Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the construction of DFG #88

Closed
wangdeze18 opened this issue Nov 29, 2021 · 10 comments
Closed

About the construction of DFG #88

wangdeze18 opened this issue Nov 29, 2021 · 10 comments

Comments

@wangdeze18
Copy link

Thank you for your great work! The code is very clear and concise to read.

I would like to ask about the logic behind each function in DFG.py. I would really like to implement CFG with reference, because I think there are times when CFG might be useful to understand the code as well.

@guoday
Copy link
Contributor

guoday commented Nov 30, 2021

We first keep a state table for all variables to indicate the last variable assignment position. And then we enumerate each variable in AST to decide whether their values changes. If their value changes (e.g. "a" in a = b + 1), the state table will update the position of "a" and record the value flow of "a" (i.e. "a" comes from "b"+1). If their value don't change (e.g. "b" in a = b + 1), we just need to record the value flow of "b" (i.e. "b" comes form the position of "b" in state table).

@wangdeze18
Copy link
Author

Thanks for your reply! This seems to have errors in extreme cases, for example, unused statements that

a = b + 1
a = c * 2

@guoday
Copy link
Contributor

guoday commented Nov 30, 2021

For this example, I don't find any problem according to my reply.
The first "a" (0,0) will come from "b"(0,2) and "1"(0,4).
The second "a" (1,0) will come from "c"(1,2) and "2"(1,4).

@wangdeze18
Copy link
Author

Thanks for the quick reply. The first statement is overwritten by the second statement and is therefore invalid, so the two data-dependent edges introduced according to the first statement are also meaningless. Of course, this is a relatively rare case (but extracting the graph features as accurately as possible is of great importance for the subsequent processing).

@guoday
Copy link
Contributor

guoday commented Nov 30, 2021

Two data-dependent edges introduced to the first statement is very important. This's also one of
our motivation for leveraging data flow. It can help to find dead code. As shown in Figure 1 of the paper, from the data flow, we can easily know that x=0 is a dead code and can help model ignore the statement.

@wangdeze18
Copy link
Author

From Figure 2, variable-alignment (dfg-to-code) considers x = 0. And, for data flow edge prediction (dfg-to-dfg), edge 7 and edge 9 will also consider the association with edge 3 (x = 0). Is it a better choice if edge 3 (x = 0) is not considered directly?

@guoday
Copy link
Contributor

guoday commented Nov 30, 2021

No, I don't think so. Considering x=0 in data flow can help model know that x=0 doesn‘t contribute return x since there's no path between x^3 and x^11 in the data flow. Therefore, the model can know that x=0 is a dead code and ignore it. The model will not be easily affected by dead codes and will be more robust.

@wangdeze18
Copy link
Author

I think perhaps this could be artificially screened out during the preprocessing phase to focus on the more important program statements. Additionally, are there any guidance suggestions for CFG construction?

@guoday
Copy link
Contributor

guoday commented Nov 30, 2021

"I think perhaps this could be artificially screened out during the preprocessing phase to focus on the more important program statements. "

Yes. You are indeed right and I totally agree. However, filtering these meaningless codes does not seem easy in the preprocessing phase. Therefore, we hope model can learn this feature in the pre-training phase. Thank you for this great idea.

"Additionally, are there any guidance suggestions for CFG construction?"

Actually I am a NLP researcher and I don't know much about CFG. Therefore, I don't know if there are any tools that can do this.

@wangdeze18
Copy link
Author

Thank you for your time!

@guody5 guody5 closed this as completed Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants