# 1. CODE

1. What is the code?
2. AST
3. DFG
4. CFG
5. Code metrics
6. Code analysis
7. Exercise
8. References

# 1. What is the code?

#### IEEE24765 Systems and software engineering Vocabulary

- *Software* --- all or part of the programs, procedures, rules, and associated documentation of an information processing system.
- *Computer program* --- a combination of computer instructions and data definitions that enable computer hardware to perform computational or control functions.
- *Programming language* --- a language used to express computer programs
- *Source code* --- computer instructions and data definitions expressed in a form suitable for input to an assembler, compiler, or other translator.

#### Is the code text?

The naturalness of software (e.g., [Hindle et al - On the naturalness of software 2016](https://dl.acm.org/doi/abs/10.1145/2902362)):
>
> *most software is also natural*, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable

- code can be usefully modeled by statistical language models
- such models can be leveraged to support software engineers

# 2. Abstract Syntax Tree (AST)

*Abstract syntax tree* is a finite labeled oriented tree in which the internal vertices correspond to programming language operators, and the leaves correspond to operands (variables and constants).

#### Example

Euclid's algorithm:

```
while b ≠ 0
    if a > b
        a := a − b
    else
        b := b − a
return a
 ```

AST:

![AST](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/Abstract_syntax_tree_for_Euclidean_algorithm.svg/425px-Abstract_syntax_tree_for_Euclidean_algorithm.svg.png)

#### ast package

https://docs.python.org/3/library/ast.html

![ast](./res/01_ast.png)

In [1]:
import ast
import astpretty
from pprint import pprint

source = '''\
def f(a, b):
    while b != 0:
        if a > b:
            a = a - b
        else:
            b = b - a
    return a'''
tree = ast.parse(source)

astpretty.pprint(tree, show_offsets=False)

Module(
    body=[
        FunctionDef(
            name='f',
            args=arguments(
                posonlyargs=[],
                args=[
                    arg(arg='a', annotation=None, type_comment=None),
                    arg(arg='b', annotation=None, type_comment=None),
                ],
                vararg=None,
                kwonlyargs=[],
                kw_defaults=[],
                kwarg=None,
                defaults=[],
            ),
            body=[
                While(
                    test=Compare(
                        left=Name(id='b', ctx=Load()),
                        ops=[NotEq()],
                        comparators=[Constant(value=0, kind=None)],
                    ),
                    body=[
                        If(
                            test=Compare(
                                left=Name(id='a', ctx=Load()),
                                ops=[Gt()],
                                comparators=[Name(id='b', ctx=Load(

#### tree-sitter

- https://tree-sitter.github.io/tree-sitter
- https://github.com/tree-sitter/py-tree-sitter

![tree-sitter](./res/01_tree-sitter.png)

In [2]:
from pathlib import Path
from tree_sitter import Language, Parser

tree_sitter_lib_path = Path('/mnt/se_data/code/tree_sitter/tree_sitter_ccpcsgohajajsphpysc.so')
assert tree_sitter_lib_path.is_file()

PY_LANGUAGE = Language(tree_sitter_lib_path, 'python')
parser = Parser()
parser.set_language(PY_LANGUAGE)

In [3]:
tree = parser.parse(bytes(source, 'utf8'))
r = tree.root_node

In [4]:
r.sexp()

'(module (function_definition name: (identifier) parameters: (parameters (identifier) (identifier)) body: (block (while_statement condition: (comparison_operator (identifier) (integer)) body: (block (if_statement condition: (comparison_operator (identifier) (identifier)) consequence: (block (expression_statement (assignment left: (identifier) right: (binary_operator left: (identifier) right: (identifier))))) alternative: (else_clause body: (block (expression_statement (assignment left: (identifier) right: (binary_operator left: (identifier) right: (identifier))))))))) (return_statement (identifier)))))'

In [5]:
dir(r)

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'byte_range',
 'child',
 'child_by_field_id',
 'child_by_field_name',
 'child_count',
 'children',
 'children_by_field_id',
 'children_by_field_name',
 'descendant_count',
 'descendant_for_byte_range',
 'descendant_for_point_range',
 'edit',
 'end_byte',
 'end_point',
 'field_name_for_child',
 'grammar_id',
 'grammar_name',
 'has_changes',
 'has_error',
 'id',
 'is_error',
 'is_extra',
 'is_missing',
 'is_named',
 'kind_id',
 'named_child',
 'named_child_count',
 'named_children',
 'named_descendant_for_byte_range',
 'named_descendant_for_point_range',
 'next_named_sibling',
 'next_parse_state',
 'next_sibling',
 'parent',
 'parse_state',
 'prev_name

In [6]:
text = r.text.decode('UTF-8')
print(text)

assert text == source

def f(a, b):
    while b != 0:
        if a > b:
            a = a - b
        else:
            b = b - a
    return a


In [7]:
r.children

[<Node type=function_definition, start_point=(0, 0), end_point=(6, 12)>]

In [8]:
r.children[0].children

[<Node type="def", start_point=(0, 0), end_point=(0, 3)>,
 <Node type=identifier, start_point=(0, 4), end_point=(0, 5)>,
 <Node type=parameters, start_point=(0, 5), end_point=(0, 11)>,
 <Node type=":", start_point=(0, 11), end_point=(0, 12)>,
 <Node type=block, start_point=(1, 4), end_point=(6, 12)>]

In [9]:
print(r.children[0].children[4].children[0].text.decode('UTF-8'))

while b != 0:
        if a > b:
            a = a - b
        else:
            b = b - a


In [10]:
# Pattern-matching: https://github.com/tree-sitter/py-tree-sitter#pattern-matching

query = PY_LANGUAGE.query(
    """
(function_definition
  name: (identifier) @function.def)

(call
  function: (identifier) @function.call)
"""
)

captures = query.captures(r)
print(captures)

[(<Node type=identifier, start_point=(0, 4), end_point=(0, 5)>, 'function.def')]


# 3. Data Flow Graph (DFG)

*Data flow graph* is a graph which represents a data dependancies between a number of operations.

#### Example

Finding the root of a quadratic equation (http://bears.ece.ucsb.edu/research-info/DP/dfg.html):
```
quad( a, b, c)
t1 = a*c;
t2 = 4*t1;
t3 = b*b;
t4 = t3 - t2;
t5 = sqrt( t4);
t6 = -b;
t7 = t6 - t5;
t8 = t7 + t5;
t9 = 2*a;
r1 = t7/t9;
r2 = t8/t9;)
```

DFG:

![DFG](http://bears.ece.ucsb.edu/research-info/DP/dfg.figure.id.1.gif)

# 4. Control Flow Graph (CFG)

*Control flow graph* is the graphical representation of control flow or computation during the execution of programs or applications.

#### Example

https://en.wikipedia.org/wiki/Control-flow_graph

![](./res/01_cfg_ab.png)

- (a) an if-then-else
- (b) a while loop

![](./res/01_cfg_cd.png)

- (c) a natural loop with two exits, e.g. while with an if...break in the middle
- (d) a loop with two entry points, e.g. goto into a while or for loop

#### python-graphs

- https://github.com/google-research/python-graphs
- https://pypi.org/project/python-graphs/
- https://github.com/google-research/python-graphs/blob/main/examples/control_flow_example.py

In [11]:
!apt install -y graphviz libgraphviz-dev
!pip install pygraphviz python-graphs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
graphviz is already the newest version (2.42.2-6).
libgraphviz-dev is already the newest version (2.42.2-6).
0 upgraded, 0 newly installed, 0 to remove and 96 not upgraded.
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [12]:
def bubbleSort(arr):
    n = len(arr)
    swapped = False
    for i in range(n-1):
        for j in range(0, n-i-1):
            if arr[j] > arr[j + 1]:
                swapped = True
                arr[j], arr[j + 1] = arr[j + 1], arr[j]        
        if not swapped:
            return

fn = bubbleSort

In [13]:
from python_graphs import control_flow

graph = control_flow.get_control_flow_graph(fn)

In [14]:
dir(graph)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'add_node',
 'blocks',
 'compact',
 'get_block_by_ast_node',
 'get_block_by_ast_node_and_label',
 'get_block_by_ast_node_type_and_label',
 'get_block_by_function_name',
 'get_block_by_source',
 'get_block_by_source_and_ast_node_type',
 'get_blocks_by_ast_node',
 'get_blocks_by_ast_node_type_and_label',
 'get_blocks_by_function_name',
 'get_blocks_by_source',
 'get_blocks_by_source_and_ast_node_type',
 'get_control_flow_node_by_ast_node',
 'get_control_flow_node_by_source',
 'get_control_flow_node_by_source_and_identifier',
 'get_control_flow_nodes',
 'get_control_flow_nodes_by_ast_node',
 'get_control_flow_nodes_by_source

In [15]:
from python_graphs import control_flow_graphviz
from python_graphs import program_utils

source = program_utils.getsource(fn)
control_flow_graphviz.render(graph, include_src=source, path=f'cfg_bubble_sort.png')

![](./res/01_cfg_bubble_sort.png)

#### Program graph

- https://github.com/google-research/python-graphs/blob/main/examples/program_graph_example.py

```
from python_graphs import program_graph
from python_graphs import program_graph_graphviz

graph = program_graph.get_program_graph(fn)
program_graph_graphviz.render(graph, path=f'pg.png')
```

# 5. Code metrics

- *You can’t control what you can't measure.*
- *Good metrics don't mean good quality.*

#### [SLOC (LOC)](https://en.wikipedia.org/wiki/Source_lines_of_code)

Source lines of code or lines of code.

*Physical SLOC (LOC)* --- a count of lines in the text of the program's source code excluding comment lines

*Logical SLOC (LLOC)* --- a number of executable "statements"

```
for (i = 0; i < 100; i++) printf("hello"); /* How many lines of code is this? */
```

- 1 physical line of code (LOC)
- 2 logical lines of code (LLOC) (for statement and printf statement)
- 1 comment line

```/* Now how many lines of code is this? */
for (i = 0; i < 100; i++)
{
    printf("hello");
}
```

- 4 physical lines of code (LOC): is placing braces work to be estimated?
- 2 logical lines of code (LLOC): what about all the work writing non-statement lines?
- 1 comment line: tools must account for all code and comments regardless of comment placement

#### Cyclomatic complexity

- [сyclomatic complexity](https://en.wikipedia.org/wiki/Cyclomatic_complexity)
- https://github.com/google-research/python-graphs/blob/main/examples/cyclomatic_complexity_example.py

*Cyclomatic complexity* is a software metric used to indicate the complexity of a program. It is a quantitative measure of the number of linearly independent paths through a program's source code.

Or

$$E − N + 2P,$$

where
- E = the number of edges of the CFG.
- N = the number of nodes of the CFG.
- P = the number of connected components.

In [16]:
from python_graphs import control_flow
from python_graphs import cyclomatic_complexity

graph = control_flow.get_control_flow_graph(fn)
value = cyclomatic_complexity.cyclomatic_complexity(graph)
print(value)

5


# 6. Code analysis

#### Static and dynamic code analysis

*Static analysis* is the analysis of computer programs performed without executing them.

*Dynamic analysis* is the analysis of computer software that involves executing the program in question.


#### The Halting Problem

> Given a program $P$ and an input $I$, will $P$ halt on $I$?

Assume that there exists an algorithm $H(P,I)$, which solves the halting problem for every progream $P$ and input $I$.

Let $X(P)$ be the following alorithm:
1. makes to copies of $P$
2. runs $H(P,P)$
3. if $H(P,P)$ outputs True, $X$ goes into infinite loop and runs forever
4. if $H(P,P)$ outputs False, $X$ terminates

Les's compare $X(X)$ and $H(X, X)$
- If $H(X, X)$ returns True,  then $X$ goes into infinite loop and doesn't halt on $X$. Contradiction.
- If $H(X, X)$ returns False, then $X$ terminates and halts on $X$. Contradiction.

Thus, $H$ doesn't exist.

Rice's theorem:
> Any nontrivial property about the language recognized by a Turing machine is undecidable.

# 7. Exercise

Дан фрагмент кода $S$ на языке Python.
Надо создать фрагмент кода $S'$, отличающийся от $S$ только следующим:
- убраны все комментарии
- табы заменены на пробелы, убраны незначащие пробелы
- названия переменнных, функций, классов заменены на новые: "name_<номер>".

# 8. References

- [IEEE24765-2017 - Systems and software engineering Vocabulary](https://www.cse.msu.edu/~cse435/Handouts/Standards/IEEE24765.pdf)
- [Compilers: Principles, Techniques, and Tools](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools)
- [Abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree)
- [Tree-sitter basics](https://eightify.app/summary/climate-change/tree-sitter-guide-mastering-the-basics)
- [Tree-sitter: traverse a large number of nodes](https://github.com/tree-sitter/py-tree-sitter?tab=readme-ov-file#walking-syntax-trees)
- [Tree-sitter: edit the syntax tree](https://github.com/tree-sitter/py-tree-sitter?tab=readme-ov-file#editing)
- [Data flow graph](http://bears.ece.ucsb.edu/research-info/DP/dfg.html)
- [Data flow graphs](https://www.youtube.com/watch?v=PmAK_5MrUMg)
- [CodeQL: Analyzing data flow in Python](https://codeql.github.com/docs/codeql-language-guides/analyzing-data-flow-in-python/)
- [Program dependence graph](https://piazza.com/class_profile/get_resource/hy7enxf648g7me/i2qodoub2q73x)
- [Call property graph](https://en.wikipedia.org/wiki/Code_property_graph)
- [Halting problem](https://cs.uwaterloo.ca/~a23gao/cs245_f17/notes/undecidability_solutions.pdf)
- [Rice's theorem](http://kilby.stanford.edu/~rvg/154/handouts/Rice.html)