Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support un-mangle identifiers #34

Open
3052 opened this issue Nov 5, 2023 · 17 comments
Open

support un-mangle identifiers #34

3052 opened this issue Nov 5, 2023 · 17 comments
Labels
enhancement New feature or request

Comments

@3052
Copy link

3052 commented Nov 5, 2023

using this code:

const c=n.a.get("API.lemonade.url"),d=n.a.get("API.lemonade.urlLinear"),u=n.a.get("API.lemonade.urlVod"),p=n.a.get("API.lemonade.platform"),m=n.a.get("API.lemonade.timeout");

I get this result:

const c = n.a.get("API.lemonade.url");
const d = n.a.get("API.lemonade.urlLinear");
const u = n.a.get("API.lemonade.urlVod");
const p = n.a.get("API.lemonade.platform");
const m = n.a.get("API.lemonade.timeout");

using another tool, I get:

const const1_ = n.a.get("API.lemonade.url"),
  const2_ = n.a.get("API.lemonade.urlLinear"),
  const3_ = n.a.get("API.lemonade.urlVod"),
  const4_ = n.a.get("API.lemonade.platform"),
  const5_ = /* global */ n.a.get("API.lemonade.timeout");

https://richsnapp.com/tools/code-unmangler

using another tool, I get:

const varFastenedWheel = n.a.get('API.lemonade.url'), varHairStepped = n.a.get('API.lemonade.urlLinear'), varFourRising = n.a.get('API.lemonade.urlVod'), varCourageYouth = n.a.get('API.lemonade.platform'), varBellRain = n.a.get('API.lemonade.timeout');

https://github.com/relative/synchrony

@pionxzh
Copy link
Owner

pionxzh commented Nov 5, 2023

Hi, currently we don't have the un-mangle feature yet. You can use other tools for the first pass, and let wakaru handle syntax-related unminification. I will transform this issue to a feature request, so that we can track the progress of it here. 🙏

@pionxzh pionxzh changed the title rename identifiers support un-mangle identifiers Nov 5, 2023
@pionxzh pionxzh added the enhancement New feature or request label Nov 5, 2023
@pionxzh
Copy link
Owner

pionxzh commented Nov 5, 2023

For now, we have smart-rename that can guess the variable name based on the context. I would like to expand it to cover some other generic cases.

@0xdevalias
Copy link

0xdevalias commented Nov 13, 2023

I just finished up writing some thoughts/references for variable renaming on the webcrack repo, that could also be a useful idea for here. (see quotes below)

When I was exploring PoC ideas for my own project previously, I was looking to generate a file similar to the 'module map' that this project is using; but instead of just for the names of modules, I wanted to be able to use it to provide a 'variable name map'. Though because the specific variables used in webpack/etc can change between builds, my thought was that first 'normalising' them to a 'known format' based on their context would make sense to do first.

That could then be later enhanced/expanded by being able to pre-process these 'variable name mappings' for various open source projects in a way that could then be applied 'automagically' without the end user needing to first create them.

It could also be enhanced by similar techniques such as what the humanify project does, by using LLMs/similar to generate suggested variable name mappings based on the code.

My personal ideal end goal for a feature like that would then allow me to use it within an IDE-like environment, where I can rename variables 'as I explore', knowing that the mappings/etc will be kept up to date.


When I was exploring this concept in my own deobfuscation PoC project, I was exploring to make the variable names unique + have them add sort of semantic information about their source/scope.

Eg. if it was an arg to a function, it might be arg_1. Or potentially if the function is foo, it might end up as foo_arg_1

It looks like most of the PoC code I was playing with was local/in a pretty messy/hacky state, but I did find a link in it to an online REPL I was playing around with some of it in. Not sure how outdated that code is, but it might be useful:

There were a number of different AST parsers I was playing around with, but I think that this babel code may have been the latest (not sure which one):

Within those files, I believe the functions getNameFromPath, getPrefix (and older commented out functions getTypePrefix, getPrefix


Edit: Came across this in another issue here:

I published my decompiler that I used in the above example. I think it might be a good reference for adding this feature.
https://github.com/e9x/krunker-decompiler

Originally posted by @e9x in j4k0xb/webcrack#10 (comment)

And looking at it's libRenameVars code seems to be taking a vaguely similar approach to how I was looking at doing things in my original PoC that I described above:

  • https://github.com/e9x/krunker-decompiler/blob/master/src/libRenameVars.ts
    • getVarPrefix will set a prefix based on the type (eg. func, arg, Class, imported, var)
    • getName generates a new variable name that does not conflict with existing names or reserved keywords
    • generateName generates a new name for a variable considering its scope, type, and the context in which it is used (e.g., whether it's a class, a function variable, etc.).
      It employs various AST manipulations to ensure the generated name is appropriate and does not conflict with existing names.

A more generalised summary/overview (via ChatGPT):

Certainly, the code implements a sophisticated algorithm for renaming variables in a JavaScript program, adhering to several high-level rules and strategies:

  1. Type-Specific Prefixing:

    • The getVarPrefix function assigns specific prefixes to variable names based on their type (e.g., "func" for function names, "arg" for parameters). This approach helps in identifying the role of a variable just by its name.
  2. Avoiding Reserved Keywords:

    • The script includes a comprehensive list of reserved JavaScript keywords. If a variable's name matches a reserved keyword, it is prefixed with an underscore to prevent syntax errors.
  3. Unique Naming with Context Consideration:

    • The generateName function ensures that each variable gets a unique name that doesn't conflict with other variables in its scope. It also considers the context in which a variable is used. For example, if a variable is part of a class, it may receive a name that reflects this context, using pascalCase or camelCase as appropriate.
  4. Handling Special Cases:

    • The script contains logic to handle special cases, such as variables that are function expressions (isFuncVar) or class instances (isClass). This affects the naming convention applied to these variables.
  5. Randomness with Mersenne Twister:

    • A Mersenne Twister is used to generate random elements for variable names, ensuring that the names are not only unique within the scope of the program but also less predictable.
  6. AST-Based Renaming:

    • The script analyzes the Abstract Syntax Tree (AST) of the program to understand the structure and scope of variables. This analysis guides the renaming process, ensuring that the new names are consistent with the variable's usage and position in the code.
  7. Scope Analysis with ESLint Scope:

    • By leveraging eslint-scope, the script can accurately determine the scope of each variable. This is crucial in avoiding name collisions and ensuring that the renaming respects lexical scoping rules in JavaScript.
  8. Consideration for Exported and Assigned Variables:

    • The script pays special attention to variables that are exported or assigned in specific ways (e.g., through Object.defineProperty). It ensures that these variables receive names that are appropriate for their roles.

In summary, the script uses a combination of type-based naming conventions, context consideration, randomness, AST analysis, and scope analysis to systematically rename variables in a JavaScript program. This approach aims to enhance readability, avoid conflicts, and maintain the logical structure of the program.

Originally posted by @0xdevalias in j4k0xb/webcrack#21 (comment)


And for an even cooler/more extreme version of improving variable naming; I just came across this blog post / project from @jehna that makes use of webcrack + ChatGPT for variable renaming:

  • https://thejunkland.com/blog/using-llms-to-reverse-javascript-minification.html
    • Using LLMs to reverse JavaScript variable name minification
      This blog introduces a novel way to reverse minified Javascript using large language models (LLMs) like ChatGPT and llama2 while keeping the code semantically intact. The code is open source and available at Github project Humanify.

  • https://github.com/jehna/humanify
    • Un-minify Javascript code using ChatGPT

    • This tool uses large language modeles (like ChatGPT & llama2) and other tools to un-minify Javascript code. Note that LLMs don't perform any structural changes – they only provide hints to rename variables and functions. The heavy lifting is done by Babel on AST level to ensure code stays 1-1 equivalent.

Originally posted by @0xdevalias in j4k0xb/webcrack#21 (comment)

@0xdevalias
Copy link

0xdevalias commented Nov 20, 2023

For now, we have smart-rename that can guess the variable name based on the context. I would like to expand it to cover some other generic cases.

Linking to my smart-rename related issues to keep the contextual link here:

@0xdevalias
Copy link

0xdevalias commented Nov 22, 2023

Another link from my reference notes that I forgot to include earlier; my thoughts on how to rename otherwise unknown variables are based on similar concepts that are used in reverse engineering tools such as IDA:

  • https://hex-rays.com/blog/igors-tip-of-the-week-34-dummy-names/
    • In IDA’s disassembly, you may have often observed names that may look strange and cryptic on first sight: sub_73906D75, loc_40721B, off_40A27C and more. In IDA’s terminology, they’re called dummy names. They are used when a name is required by the assembly syntax but there is nothing suitable available

    • https://www.hex-rays.com/products/ida/support/idadoc/609.shtml
      • IDA Help: Names Representation

      • Dummy names are automatically generated by IDA. They are used to denote subroutines, program locations and data. Dummy names have various prefixes depending on the item type and value

Originally posted by @0xdevalias in j4k0xb/webcrack#21 (comment)


And a few more I was looking at recently as well (that is sort of basically smart-rename:

  • https://binary.ninja/2023/09/15/3.5-expanded-universe.html#automatic-variable-naming
    • Automatic Variable Naming
      One easy way to improve decompilation output is to come up with better default names for variables. There’s a lot of possible defaults you could choose and a number of different strategies are seen throughout different reverse engineering tools. Prior to 3.5, Binary Ninja left variables named based on their origin. Stack variables were var_OFFSET, register-based variables were reg_COUNTER, and global data variables were (data_). While this scheme isn’t changing, we’re being much more intelligent about situations where additional information is available.

      For example, if a variable is passed to a function and a variable name is available, we can now make a much better guess for the variable name. This is most obvious in binaries with type libraries.

    • This isn’t the only style of default names. Binary Ninja also will name loop counters with simpler names like i, or j, k, etc (in the case of nested loops)

  • Variable name propagation Vector35/binaryninja-api#2558

@0xdevalias
Copy link

0xdevalias commented Dec 4, 2023

Was looking closer at the sourcemap spec today, and the names field jumped out at me as potentially useful:

  • https://tc39.es/source-map-spec/#names
    • names: a list of symbol names used by the mappings entry

  • https://tc39.es/source-map-spec/#mappings
    • mappings: a string with the encoded mapping data (see 4.1 Mappings Structure)

  • https://tc39.es/source-map-spec/#mappings-structure
    • The mappings data is broken down as follows:

      • each group representing a line in the generated file is separated by a semicolon (;)
      • each segment is separated by a comma (,)
      • each segment is made up of 1, 4, or 5 variable length fields.
    • It then goes on to describe the segment's in greater detail, but the specific part I was thinking could be relevant here would be this:
      • If present, the zero-based index into the names list associated with this segment. This field is a base 64 VLQ relative to the previous occurrence of this field unless this is the first occurrence of this field, in which case the whole value is represented.

Obviously if there is a full sourcemap for the webapp, then wakaru isn't really needed anyway.. but what I was thinking of here is that in combination with module detection (see - #41), if there are sourcemapss available for that original module, then we could potentially extract the original function/variable/etc names from the names field of the sourcemap, and use them in a sort of 'smart-rename with sourcemap' type way.


Another sourcemap related idea I had (which probably deserves it's own issue) is that it would be cool to be able to 'retroactively generate a sourcemap) for a webapp, based on the unminified output from wakaru; such that we could than take that sourcemap, and apply it to the original minified web app source for debugging the live app.

Edit: Created a new issue to track this:

@pionxzh
Copy link
Owner

pionxzh commented Dec 4, 2023

I have considered this before. The conclusion is that Wakaru is built for unmodified code without sourcemap. It isn't very meaningful to support such a feature when you can access all the source code.

@0xdevalias
Copy link

It isn't very meaningful to support such a feature when you can access all the source code.

@pionxzh I was specifically talking about it in terms of bundled modules (eg. React, etc), and not the unique web app code of the app itself.

@pionxzh
Copy link
Owner

pionxzh commented Dec 5, 2023

You mean like, for popular open-source projects, we can put some sourcemap in our project / read from the chunk, and then reverse map the minified variable and function name back to normal?

@StringKe
Copy link

StringKe commented Dec 5, 2023

A configuration table/profile can be provided to allow users to manually write correspondences. wakaru can simply include the rules of the better known packages.

@pionxzh
Copy link
Owner

pionxzh commented Dec 6, 2023

@StringKe Can you specify the content that you would expect to have? and the corresponding behavior 🙏

@0xdevalias
Copy link

0xdevalias commented Dec 6, 2023

You mean like, for popular open-source projects, we can put some sourcemap in our project / read from the chunk, and then reverse map the minified variable and function name back to normal?

@pionxzh Similar to that, but probably not "put the sourcemap in our project" directly; but more process the sourcemaps from popular open-source projects and extract those details to an 'intermediary form'. That 'intermediary form' would be similar to the 'module map' file, as I described earlier in this thread:

When I was exploring PoC ideas for my own project previously, I was looking to generate a file similar to the 'module map' that this project is using; but instead of just for the names of modules, I wanted to be able to use it to provide a 'variable name map'. Though because the specific variables used in webpack/etc can change between builds, my thought was that first 'normalising' them to a 'known format' based on their context would make sense to do first.

That could then be later enhanced/expanded by being able to pre-process these 'variable name mappings' for various open source projects in a way that could then be applied 'automagically' without the end user needing to first create them.

It could also be enhanced by similar techniques such as what the humanify project does, by using LLMs/similar to generate suggested variable name mappings based on the code.

Originally posted by @0xdevalias in #34 (comment)


A configuration table/profile can be provided to allow users to manually write correspondences. wakaru can simply include the rules of the better known packages.

@StringKe nods, sounds like we are thinking about similar things here :)


Can you specify the content that you would expect to have? and the corresponding behavior

@pionxzh For me personally, I haven't deeply thought through all the use cases in depth, but at a high level I basically want to be able to take a web app that is going to be re-built multiple times, and be able to have a 'config file' similar to the 'module mapping' that wakaru has/had; but that also allows me to specify the variable/function names ('symbols') that are used within it.

The slightly more challenging part is that because the app will be re-built multiple times, the minified variables will change (sometimes every build), so we can't easily use those as the 'key' of the mapping. One idea I had for solving that is potentially by first renaming all of the variables based on a 'stable naming pattern' (eg. func_*, arg_*, const_*, etc; and then could just use a counter/similar based on the 'scope' it's being defined in) that would be generated based on the scope/type of the 'symbol', and would therefore be resilient to the minified variable names changing each build. Those 'stable intermediary names' could then potentially be used for the keys in the variable mapping.

Though then we also need to figure out what level of 'granularity' makes sense to generate those 'stable intermediary names' at; as having a 1:1 mapping of those 'stable name scopes' to JS scopes could potentially end up being really noisy in the mapping file. So maybe using a 'higher abstracted scope' would make more sense (eg. at the module level or similar)

My original hacky implementation of this in my own PoC code was using JS objects/JSON to map an explicit minified variable name to it's 'proper' name; but that broke because the minified names changed between builds. Even by implementing the 'stable naming pattern', if those 'stable names' included a 'counter' in them (eg. func_1, const_8, etc) we still probably wouldn't want to use those stable names directly as the key of an object, as if a new variable was added 'in between' in a later build, that would flow on to 'shifting' the 'counter' for every variable of a matching type afterwards, which would be a lot of effort to manually update in a mapping file. While I haven't thought too deeply about it, I think that by using an array in the mapping file, it should simplify things so that we only need to make a small change to 'fix the mappings' when a new variable is added that 'shifts' everything.

Even by using the array concept in the mappings file, there is still some manual pain/effort involved in trying to keep the mapping 'up to date' in newer builds. That's what lead me into some of the deeper/more esoteric ideas/thinking around 'fingerprinting' that I expand on below.

--

(Edit: I have created a new issue for the AST fingerprinting stuff described below, see #74)

Another area I started looking into (but haven't deeply explored yet) for both figuring out how to map variable names to sections of code in a 'smart' way, and potentially also for module identification (see #41); is in the space of 'structural AST fingerprinting' or 'code similarity' algorithms and similar. (I realise that this is a rather deep/esoteric angle to be looking at this from, and that there are likely going to be far simpler/easier ways to implement the variable mapping/module identification in a 'good enough' way without going to this level of depth; but I'm curious to explore it regardless, to see if any good ideas come out of it)

I haven't gotten too far in my reading yet (got distracted on other things), but the high level of my idea was that maybe we could generate an 'AST fingerprint' that isn't impacted by the variable/function/etc names ('symbols') changing during minification; and then use that as the basis for the 'key' in the 'mappings file'; as that fingerprint could theoretically still identify a 'scope' (which might be a literal JS scope, or might be a higher level abstraction that we decide makes sense; the most abstract being probably at the bundled module level) even if the bundler decides to move some functions around to a different module/etc. Then obviously if we were able to generate those 'resilient fingerprints' to identify code even when it's been minified, that would make perfect sense to apply to module detection/etc (see #41) as well.

Some of the high level ideas / search terms that I was using to start my research in that area was things like:

  • AST fingerprinting
  • Source code similarity fingerprinting
  • Control flow graphs
  • Call flow graphs
  • Program dependence graph
  • etc

Here is a link dump of a bunch of the tabs I have open but haven't got around to reviewing in depth yet, RE: 'AST fingerprinting' / Code Similarity / etc:

Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity
  • https://en.wikipedia.org/wiki/Program_dependence_graph
    • Program Dependence Graph - Wikipedia

    • In computer science, a Program Dependence Graph (PDG) is a representation of a program's control and data dependencies. It's a directed graph where nodes represent program statements, and edges represent dependencies between these statements. PDGs are useful in various program analysis tasks, including optimizations, debugging, and understanding program behavior.

  • https://en.wikipedia.org/wiki/Control-flow_graph
    • Control-Flow Graph - Wikipedia

    • In computer science, a control-flow graph (CFG) is a representation, using graph notation, of all paths that might be traversed through a program during its execution.

    • In a control-flow graph each node in the graph represents a basic block, i.e. a straight-line piece of code without any jumps or jump targets; jump targets start a block, and jumps end a block. Directed edges are used to represent jumps in the control flow. There are, in most presentations, two specially designated blocks: the entry block, through which control enters into the flow graph, and the exit block, through which all control flow leaves.

  • https://stackoverflow.com/questions/7283702/assembly-level-function-fingerprint
    • Stack Overflow: Assembly-level function fingerprint (2011)

  • https://stackoverflow.com/questions/15087195/data-flow-graph-construction
    • Stack Overflow: Data Flow Graph Construction (2013)

  • https://codereview.stackexchange.com/questions/276387/call-flow-graph-from-python-abstract-syntax-tree
    • Code Review Stack Exchange: Call-flow graph from Python abstract syntax tree (2022)

  • https://codeql.github.com/docs/writing-codeql-queries/about-data-flow-analysis/
    • CodeQL Documentation: About data flow analysis

    • Data flow analysis is used to compute the possible values that a variable can hold at various points in a program, determining how those values propagate through the program and where they are used.

  • https://clang.llvm.org/docs/DataFlowAnalysisIntro.html
    • Clang Documentation: Data flow analysis: an informal introduction

    • This document introduces data flow analysis in an informal way. The goal is to give the reader an intuitive understanding of how it works, and show how it applies to a range of refactoring and bug finding problems.

    • Data flow analysis is a static analysis technique that proves facts about a program or its fragment. It can make conclusions about all paths through the program, while taking control flow into account and scaling to large programs. The basic idea is propagating facts about the program through the edges of the control flow graph (CFG) until a fixpoint is reached.

  • https://openreview.net/forum?id=BJxWx0NYPr
    • Adaptive Structural Fingerprints for Graph Attention Networks (2019)

    • Graph attention network (GAT) is a promising framework to perform convolution and massage passing on graphs. Yet, how to fully exploit rich structural information in the attention mechanism remains a challenge. In the current version, GAT calculates attention scores mainly using node features and among one-hop neighbors, while increasing the attention range to higher-order neighbors can negatively affect its performance, reflecting the over-smoothing risk of GAT (or graph neural networks in general), and the ineffectiveness in exploiting graph structural details. In this paper, we propose an "adaptive structural fingerprint" (ADSF) model to fully exploit graph topological details in graph attention network. The key idea is to contextualize each node with a weighted, learnable receptive field encoding rich and diverse local graph structures. By doing this, structural interactions between the nodes can be inferred accurately, thus significantly improving subsequent attention layer as well as the convergence of learning. Furthermore, our model provides a useful platform for different subspaces of node features and various scales of graph structures to 'cross-talk' with each other through the learning of multi-head attention, being particularly useful in handling complex real-world data. Empirical results demonstrate the power of our approach in exploiting rich structural information in GAT and in alleviating the intrinsic oversmoothing problem in graph neural networks.

  • https://dl.acm.org/doi/10.1145/3486860
    • A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features (2022)

    • Binary code fingerprinting is crucial in many security applications. Examples include malware detection, software infringement, vulnerability analysis, and digital forensics. It is also useful for security researchers and reverse engineers since it enables high fidelity reasoning about the binary code such as revealing the functionality, authorship, libraries used, and vulnerabilities. Numerous studies have investigated binary code with the goal of extracting fingerprints that can illuminate the semantics of a target application. However, extracting fingerprints is a challenging task since a substantial amount of significant information will be lost during compilation, notably, variable and function naming, the original data and control flow structures, comments, semantic information, and the code layout. This article provides the first systematic review of existing binary code fingerprinting approaches and the contexts in which they are used. In addition, it discusses the applications that rely on binary code fingerprints, the information that can be captured during the fingerprinting process, and the approaches used and their implementations. It also addresses limitations and open questions related to the fingerprinting process and proposes future directions.

  • https://inria.hal.science/hal-01648996/document
    • BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables (2017)

    • Binary code fingerprinting is a challenging problem that requires an in-depth analysis of binary components for deriving identifiable signatures. Fingerprints are useful in automating reverse engineering tasks including clone detection, library identification, authorship attribution, cyber forensics, patch analysis, malware clustering, binary auditing, etc. In this paper, we present BinSign, a binary function fingerprinting framework. The main objective of BinSign is providing an accurate and scalable solution to binary code fingerprinting by computing and matching structural and syntactic code profiles for disassemblies. We describe our methodology and evaluate its performance in several use cases, including function reuse, malware analysis, and indexing scalability. Additionally, we emphasize the scalability aspect of BinSign. We perform experiments on a database of 6 million functions. The indexing process requires an average time of 0.0072 seconds per function. We find that BinSign achieves higher accuracy compared to existing tools.

  • https://hal.science/hal-00627811/document
    • Syntax tree fingerprinting: a foundation for source code similarity detection (2011)

    • Plagiarism detection and clone refactoring in software depend on one common concern: finding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modifications are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits.
      Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Dependency Graph (PDG), we believe that the AST could efficiently support the program analysis and transformations required for the advanced similarity detection process.
      In this paper we present a simple and scalable architecture based on syntax tree fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.

  • https://ieeexplore.ieee.org/document/5090050
    • Syntax tree fingerprinting for source code similarity detection (2009)

    • Numerous approaches based on metrics, token sequence pattern-matching, abstract syntax tree (AST) or program dependency graph (PDG) analysis have already been proposed to highlight similarities in source code: in this paper we present a simple and scalable architecture based on AST fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.

    • https://igm.univ-mlv.fr/~chilowi/research/syntax_tree_fingerprinting/syntax_tree_fingerprinting_ICPC09.pdf
  • https://ieeexplore.ieee.org/document/9960266
    • Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings (2022)

    • Syntax Tree (AST) is an abstract logical structure of source code represented as a tree. This research utilizes information of fingerprinting with AST to locate the similarities between source codes. The proposed method can detect plagiarism in source codes using the number of duplicated logical structures. The structural information of program is stored in the fingerprints format. Then, the fingerprints of source codes are compared to identify number of similar nodes. The final output is calculated from number of similar nodes known as similarities scores. The result shows that the proposed method accurately captures the common modification techniques from basic to advance.

  • https://digitalcommons.calpoly.edu/theses/2040/
    • Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints (2019)

    • Code clones are pieces of code that have the same functionality. While some clones may structurally match one another, others may look drastically different. The inclusion of code clones clutters a code base, leading to increased costs through maintenance. Duplicate code is introduced through a variety of means, such as copy-pasting, code generated by tools, or developers unintentionally writing similar pieces of code. While manual clone identification may be more accurate than automated detection, it is infeasible due to the extensive size of many code bases. Software code clone detection methods have differing degree of success based on the analysis performed. This thesis outlines a method of detecting clones using a program dependence graph and subgraph isomorphism to identify similar subgraphs, ultimately illuminating clones. The project imposes few constraints when comparing code segments to potentially reveal more clones.

    • https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=3437&context=theses
  • https://dl.acm.org/doi/10.1145/1286821.1286826
    • Dynamic graph-based software fingerprinting (2007)

    • Fingerprinting embeds a secret message into a cover message. In media fingerprinting, the secret is usually a copyright notice and the cover a digital image. Fingerprinting an object discourages intellectual property theft, or when such theft has occurred, allows us to prove ownership.

      The Software Fingerprinting problem can be described as follows. Embed a structure W into a program P such that: W can be reliably located and extracted from P even after P has been subjected to code transformations such as translation, optimization and obfuscation; W is stealthy; W has a high data rate; embedding W into P does not adversely affect the performance of P; and W has a mathematical property that allows us to argue that its presence in P is the result of deliberate actions.

      In this article, we describe a software fingerprinting technique in which a dynamic graph fingerprint is stored in the execution state of a program. Because of the hardness of pointer alias analysis such fingerprints are difficult to attack automatically.

    • https://dl.acm.org/doi/pdf/10.1145/1286821.1286826
  • https://patents.google.com/patent/US9459861B1/en
    • Systems and methods for detecting copied computer code using fingerprints (2016)

    • Systems and methods of detecting copying of computer code or portions of computer code involve generating unique fingerprints from compiled computer binaries. The unique fingerprints are simplified representations of functions in the compiled computer binaries and are compared with each other to identify similarities between functions in the respective compiled computer binaries. Copying can be detected when there are sufficient similarities between fingerprints of two functions.

  • https://www.unomaha.edu/college-of-information-science-and-technology/research-labs/_files/software-nsf.pdf
    • Software Fingerprinting in LLVM (2021)

    • Executable steganography, the hiding of software machine code inside of a larger program, is a potential approach to introduce new software protection constructs such as watermarks or fingerprints. Software fingerprinting is, therefore, a process similar to steganography, hiding data within other data. The goal of fingerprinting is to hide a unique secret message, such as a serial number, into copies of an executable program in order to provide proof of ownership of that program. Fingerprints are a special case of watermarks, with the difference being that each fingerprint is unique to each copy of a program. Traditionally, researchers describe four aims that a software fingerprint should achieve. These include the fingerprint should be difficult to remove, it should not be obvious, it should have a low false positive rate, and it should have negligible impact on performance. In this research, we propose to extend these objectives and introduce a fifth aim: that software fingerprints should be machine independent. As a result, the same fingerprinting method can be used regardless of the architecture used to execute the program. Hence, this paper presents an approach towardsthe realization of machine-independent fingerprinting of executable programs. We make use of Low-Level Virtual Machine (LLVM) intermediate representation during the software compilation process to demonstrate both a simple static fingerprinting method as well as a dynamic method, which displays our aim of hardware independent fingerprinting. The research contribution includes a realization of the approach using the LLVM infrastructure and provides a proof of concept for both simple static and dynamic watermarks that are architecture neutral.

  • https://www.computer.org/csdl/journal/ts/2023/08/10125077/1Nc4Vd4vb7W
    • Graph-of-Code: Semantic Clone Detection Using Graph Fingerprints (2023)

    • The code clone detection issue has been researched using a number of explicit factors based on the tokens and contents and found effective results. However, exposing code contents may be an impractical option because of privacy and security factors. Moreover, the lack of scalability of past methods is an important challenge. The code flow states can be inferred by code structure and implicitly represented using empirical graphs. The assumption is that modelling of the code clone detection problem can be achieved without the content of the codes being revealed. Here, a Graph-of-Code concept for the code clone detection problem is introduced, which represents codes into graphs. While Graph-of-Code provides structural properties and quantification of its characteristics, it can exclude code contents or tokens to identify the clone type. The aim is to evaluate the impact of graph-of-code structural properties on the performance of code clone detection. This work employs a feature extraction-based approach for unlabelled graphs. The approach generates a “Graph Fingerprint” which represents different topological feature levels. The results of code clone detection indicate that code structure has a significant role in detecting clone types. We found different GoC-models outperform others. The models achieve between 96% to 99% in detecting code clones based on recall, precision, and F1-Score. The GoC approach is capable in detecting code clones with scalable dataset and with preserving codes privacy.

  • https://www.cs.columbia.edu/~suman/secure_sw_devel/Basic_Program_Analysis_CF.pdf
    • Slides: Basic Program Analysis - Suman Jana

    • ChatGPT Summary / Abstract:
      • Title: Basic Program Analysis

        Author: Suman Jana

        Institution: Columbia University

        Abstract:
        This document delves into the foundational concepts and techniques involved in program analysis, particularly focusing on control flow and data flow analysis essential for identifying security bugs in source code. The objective is to equip readers with the understanding and tools needed to effectively analyze programs without building systems from scratch, utilizing existing frameworks such as LLVM for customization and enhancement of analysis processes.

        The core discussion includes an overview of compiler design with specific emphasis on the Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Data Flow Analysis. These elements are critical in understanding the structure of source code and its execution flow. The document highlights the conversion of source code into AST and subsequently into CFG, where data flow analysis can be applied to optimize code and identify potential security vulnerabilities.

        Additionally, the paper explores more complex topics like identifying basic blocks within CFG, constructing CFG from basic blocks, and advanced concepts such as loop identification and the concept of dominators in control flow. It also addresses the challenges and solutions related to handling irreducible Control Flow Graphs (CFGs), which are crucial for the analysis of less structured code.

        Keywords: Program Analysis, Compiler Design, Abstract Syntax Tree (AST), Control Flow Graph (CFG), Data Flow Analysis, LLVM, Security Bugs.

  • https://www.researchgate.net/publication/370980383_A_graph-based_code_representation_method_to_improve_code_readability_classification
    • A graph-based code representation method to improve code readability classification (2023)

    • Context Code readability is crucial for developers since it is closely related to code maintenance and affects developers’ work efficiency. Code readability classification refers to the source code being classified as pre-defined certain levels according to its readability. So far, many code readability classification models have been proposed in existing studies, including deep learning networks that have achieved relatively high accuracy and good performance. Objective However, in terms of representation, these methods lack effective preservation of the syntactic and semantic structure of the source code. To extract these features, we propose a graph-based code representation method. Method Firstly, the source code is parsed into a graph containing its abstract syntax tree (AST) combined with control and data flow edges to reserve the semantic structural information and then we convert the graph nodes’ source code and type information into vectors. Finally, we train our graph neural networks model composing Graph Convolutional Network (GCN), DMoNPooling, and K-dimensional Graph Neural Networks (k-GNNs) layers to extract these features from the program graph. Result We evaluate our approach to the task of code readability classification using a Java dataset provided by Scalabrino et al. (2016). The results show that our method achieves 72.5% and 88% in three-class and two-class classification accuracy, respectively. Conclusion We are the first to introduce graph-based representation into code readability classification. Our method outperforms state-of-the-art readability models, which suggests that the graph-based code representation method is effective in extracting syntactic and semantic information from source code, and ultimately improves code readability classification.

Edit: Started a new gist to keep my notes/references altogether in one place in a better way + added the above linkdump to it:

--

Another idea I've had, but only lightly explored so far, is looking into how various projects like Terser, Webpack, etc choose their minified variable names in general; but also how they handle 'stable minified variables' between builds (which is something that I know at least Webpack has some concept of). My thought there is that by understanding how they implement their own 'stable minified variables between builds', that we might be able to leverage to either a) do similar, or b) be able to reverse engineer that in a way that might be able to be 'retroactively applied' on top of an existing minified project that didn't use 'stable minified variables', to 'stabilise' them.

@pionxzh
Copy link
Owner

pionxzh commented Dec 7, 2023

Let me share a bit of my current thoughts on this:

  1. Introducing module graph: Like Webpack and other bundlers, a module graph can help us unminify/rename identifiers and exports from bottom to top.
  2. Based on 1, the steps gonna be like [unpacked] -> [???] -> [unminify]. This new step will build the module graph, do module scanning, rename the file smartly, and provide this information to unminify.
  3. In the module graph, we can have a map for all exported names and top-level variables/functions, which also allows the user to guide the tool to improve the mapping.
  4. Module graph also brings the possibility of cross-module renaming. For example, un-indirect-call shall detect some pattern and rename the minified export name back to the real name.
  5. I like the idea of "AST fingerprinting". This can also be used in module scanning to replace the current regex implementation.

It's ok to not link this response everywhere as I'm still thinking about this. And it should be moved to a new issue.

@0xdevalias
Copy link

0xdevalias commented Dec 12, 2023

And it should be moved to a new issue.

@pionxzh I'll create one for it now; as I just wrote out a bunch of thoughts for it that probably shouldn't clutter up this issue more.

Edit: Created:

And I also created a new issue for the 'AST fingerprinting' concept as well:

@0xdevalias
Copy link

0xdevalias commented Feb 29, 2024

copilot now has a similar feature: https://code.visualstudio.com/updates/v1_87#_rename-suggestions
worth looking into how they've done it

Originally posted by @j4k0xb in jehna/humanify#8 (comment)


Release detailed here:

Couldn't see any overly relevant commits in that range, but did find the following searching the issue manually:

Which lead me to this label:

And these issues, which sound like there are 'rename providers' used by the feature:

More docs about rename providers here:

Based on the above, and the release notes explicitly mentioning copilot, I suspect the implementation will be in the Copilot extension itself (which isn't open-source):

⇒ file GitHub.copilot-1.168.741.vsix

GitHub.copilot-1.168.741.vsix: Zip archive data, at least v2.0 to extract, compression method=deflate

Though unzipping that and searching for provideRename didn't seem to turn up anything useful unfortunately.

Originally posted by @0xdevalias in jehna/humanify#8 (comment)

@0xdevalias
Copy link

0xdevalias commented Mar 12, 2024

Continued context from above, it seems that this is implemented via a VSCode proposed API NewSymbolNamesProvider:


It's less about "reverse engineering GitHub copilot" and more about "trying to figure out where the 'rename suggestions' change mentioned in the VSCode release notes was actually implemented; and what mechanism 'integrates' it into VSCode'".

The above is assumptions + an attempt to figure that out; but if you're able to point me to the actual issue/commit on the VSCode side (assuming it was implemented there), or confirm whether it's implemented on the closed source GitHub Copilot extension side of things (if it was implemented there), that would be really helpful.

If it was implemented on the GitHub Copilot extension side of things, then confirming whether the VSCode extension 'rename provider' is the right part of the VSCode extension API to look at to implement a similar feature would be awesome.

Originally posted by @0xdevalias in jehna/humanify#8 (comment)


Thank you for taking interest in this API. The rename suggestions feature is powered by a proposed API defined here. Extensions provide the suggestions, while the vscode shows them in the rename widget.

Originally posted by @ulugbekna in jehna/humanify#8 (comment)

@0xdevalias
Copy link

I think I shared some of these links in chat already, but adding them here for future reference as well:

Stack Graphs (an evolution of Scope Graphs) sound like they could be really interesting/useful with regards to code navigation, symbol mapping, etc. Perhaps we could use them for module identification, or variable/function identifier naming stabilisation or similar?

  • https://github.blog/changelog/2024-03-14-precise-code-navigation-for-typescript-projects/
    • Precise code navigation is now available for all TypeScript repositories.
      Precise code navigation gives more accurate results by only considering the set of classes, functions, and imported definitions that are visible at a given point in your code.

      Precise code navigation is powered by the stack graphs framework.
      You can read about how we use stack graphs for code navigation and visit the stack graphs definition for TypeScript to learn more.

      • https://github.blog/2021-12-09-introducing-stack-graphs/
        • Introducing stack graphs

        • Precise code navigation is powered by stack graphs, a new open source framework we’ve created that lets you define the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, we can generate code navigation data for a repository without requiring any configuration from the repository owner, and without tapping into a build process or other CI job.

        • LOTS of interesting stuff in this post..
        • As part of developing stack graphs, we’ve added a new graph construction language to Tree-sitter, which lets you construct arbitrary graph structures (including but not limited to stack graphs) from parsed CSTs. You use stanzas to define the gadget of graph nodes and edges that should be created for each occurrence of a Tree-sitter query, and how the newly created nodes and edges should connect to graph content that you’ve already created elsewhere.

        • Why aren’t we using the Language Server Protocol (LSP) or Language Server Index Format (LSIF)?

          To dig even deeper and learn more, I encourage you to check out my Strange Loop talk and the stack-graphs crate: our open source Rust implementation of these ideas.

  • https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github
    • GitHub has developed two code navigation approaches based on the open source tree-sitter and stack-graphs library:

      • Search-based - searches all definitions and references across a repository to find entities with a given name
      • Precise - resolves definitions and references based on the set of classes, functions, and imported definitions at a given point in your code

      To learn more about these approaches, see "Precise and search-based navigation."

      • https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github#precise-and-search-based-navigation
        • Precise and search-based navigation
          Certain languages supported by GitHub have access to precise code navigation, which uses an algorithm (based on the open source stack-graphs library) that resolves definitions and references based on the set of classes, functions, and imported definitions that are visible at any given point in your code. Other languages use search-based code navigation, which searches all definitions and references across a repository to find entities with a given name. Both strategies are effective at finding results and both make sure to avoid inappropriate results such as comments, but precise code navigation can give more accurate results, especially when a repository contains multiple methods or functions with the same name.

  • https://pl.ewi.tudelft.nl/research/projects/scope-graphs/
    • Scope Graphs | A Theory of Name Resolution

    • Scope graphs provide a new approach to defining the name binding rules of programming languages. A scope graph represents the name binding facts of a program using the basic concepts of declarations and reference associated with scopes that are connected by edges. Name resolution is defined by searching for paths from references to declarations in a scope graph. Scope graph diagrams provide an illuminating visual notation for explaining the bindings in programs.

Originally posted by @0xdevalias in 0xdevalias/chatgpt-source-watch#11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants