# 01. CODE

1. Code is written by humans
2. Code is for humans
3. Code is for machines
4. Code and LLMs
5. Code representations
   - AST
   - DFG
   - CFG
7. Code merics
8. Code analysis
9. Exercise
10. References

# 1. Code is written by humans

#### IEEE24765 SYSTEMS AND SOFTWARE ENGINEERING VOCABULARY

- *Software* --- all or part of the programs, procedures, rules, and associated documentation of an information processing system.
- *Computer program* --- a combination of computer instructions and data definitions that enable computer hardware to perform computational or control functions.
- *Programming language* --- a language used to express computer programs.
- *Source code* --- computer instructions and data definitions expressed in a form suitable for input to an assembler, compiler, or other translator.

Until recently, code was almost always written by people (exception, autogenerated code).
Programming languages are used to write code.

Programming languages are different from natural languages.
They have more restrictions, more rules.

What are the requirements for the code?
1. code must work according to its purpose
2. code must be manageable: it is easy to maintain, develop

Thus, the code is written by people for people and for machines.

# 2. Code is for humans

Proof that code is written for humans:
- comments
- style guides (e.g. [Google Style Guides](https://google.github.io/styleguide/), linters)
- etc.

Why do programmers read code?
- to understand what is happening in the code
- to find and fix bugs
- code optimization
- etc.

# 3. Code is for machines

Proof that the code is written for machines:
- use of programming languages
- etc.

Source: [Aho et al - Compilers](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools)


A _compiler_ is a program that can read a program in one language --- the _source_ language --- and translate it into an equivalent program in another language --- the *target* language.

![](./res/01_compiler.png)

If the target program is an executable machine-language program, it can then be called by the user to process inputs and produce outputs.

![](./res/01_target_program.png)

An _interpreter_ directly executes the operations specified in the source program on inputs supplied by
the user.

![](./res/01_interpreter.png)

A source program may be divided into modules stored in separate files. The task of collecting the source
program is sometimes entrusted to a separate program, called a _preprocessor_. The preprocessor may also expand shorthands, called macros, into source language statements.

The modified source program is then fed to a compiler. The compiler may produce an assembly-language program as its output, because assembly language is easier to produce as output and is easier to debug. The assembly language is then processed by a program called an _assembler_ that produces relocatable machine code as its output.

Large programs are often compiled in pieces, so the relocatable machine code may have t o be linked together with other relocatable object files and library files into the code that actually runs on the machine. The _linker_ resolves external memory addresses, where the code in one file may refer to a location in another file. The _loader_ then puts together all of the executable object files
into memory for execution.


![](./res/01_a_language-processing_system.png)

#### HOW DOES A COMPILER WORK?

1. analysis
2. synthesis

Phases of a compiler:

![](./res/01_phases_of_a_compiler.png)

Translation of an assignment statement: `position = initial + rate * 60`

![](./res/01_translation.png)

#### LEXICAL ANALYSIS

The lexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called _lexemes_.

For example, suppose a source program contains the assignment statement

`position = initial + rate * 60`

Lexemes: `position`, `=`, `initial`, `+`, `rate`, `*`, `60`.

#### SYNTAX ANALYSIS

The second phase of the compiler is _syntax analysis_ or _parsing_.

The parser uses the first components of the tokens produced by the lexical analyzer to create a tree-like intermediate representation that depicts the grammatical structure of the token stream. A typical representation is a syntax tree in which each interior node represents an operation and the children of the node represent the arguments of the operation.

#### SEMATIC ANALYSIS

The _semantic analyzer_ uses the syntax tree and the information in the symbol table to check the source program for semantic consistency with the language definition.

It also gathers type information and saves it in either the syntax tree or the symbol table, for subsequent use during intermediate-code generation.

# 4. Code and LLMs

#### IS THE CODE TEXT?

At first glance, code looks very different from natural language text.
And one might expect that NLP approaches would not work well with code.
Up until a certain point, this was true.
Then came _language models_ --- models that predict the next element of text. These models were good at both NLP and code tasks.

The naturalness of software (e.g., [Hindle et al - On the naturalness of software 2016](https://dl.acm.org/doi/abs/10.1145/2902362)):
>
> *most software is also natural*, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable

- code can be usefully modeled by statistical language models
- such models can be leveraged to support software engineers

LLMs are universal models.
They can work with text, code, and other information.
Sometimes we can get a better solution to a problem if we send not only code to the model, but also useful additional information.
For example, how data flow graph.

# 5. Code representations

## 5.1. Abstract Syntax Tree (AST)

*Abstract syntax tree* is a finite labeled oriented tree in which the internal vertices correspond to programming language operators, and the leaves correspond to operands (variables and constants).

#### Example

Euclid's algorithm:

```
while b ≠ 0
    if a > b
        a := a − b
    else
        b := b − a
return a
 ```

AST:

![AST](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/Abstract_syntax_tree_for_Euclidean_algorithm.svg/425px-Abstract_syntax_tree_for_Euclidean_algorithm.svg.png)

#### ast package

https://docs.python.org/3/library/ast.html

![ast](./res/01_ast.png)

In [None]:
import ast
import astpretty
from pprint import pprint

source = '''\
def f(a, b):
    while b != 0:
        if a > b:
            a = a - b
        else:
            b = b - a
    return a'''
tree = ast.parse(source)

astpretty.pprint(tree, show_offsets=False)

#### tree-sitter

- https://tree-sitter.github.io/tree-sitter
- https://github.com/tree-sitter/py-tree-sitter

![tree-sitter](./res/01_tree-sitter.png)

In [None]:
from pathlib import Path
from tree_sitter import Language, Parser

tree_sitter_lib_path = Path('tree_sitter.so')
assert tree_sitter_lib_path.is_file()

PY_LANGUAGE = Language(tree_sitter_lib_path, 'python')
parser = Parser()
parser.set_language(PY_LANGUAGE)

In [None]:
tree = parser.parse(bytes(source, 'utf8'))
r = tree.root_node

In [None]:
r.sexp()

In [None]:
dir(r)

In [None]:
text = r.text.decode('UTF-8')
print(text)

assert text == source

In [None]:
r.children

In [None]:
r.children[0].children

In [None]:
print(r.children[0].children[4].children[0].text.decode('UTF-8'))

In [None]:
# Pattern-matching: https://github.com/tree-sitter/py-tree-sitter#pattern-matching

query = PY_LANGUAGE.query(
    """
(function_definition
  name: (identifier) @function.def)

(call
  function: (identifier) @function.call)
"""
)

captures = query.captures(r)
print(captures)

## 5.2. Data Flow Graph (DFG)

*Data flow graph* is a graph which represents a data dependencies between a number of operations.

#### Example

Finding the root of a quadratic equation (http://bears.ece.ucsb.edu/research-info/DP/dfg.html):
```
quad( a, b, c)
t1 = a*c;
t2 = 4*t1;
t3 = b*b;
t4 = t3 - t2;
t5 = sqrt( t4);
t6 = -b;
t7 = t6 - t5;
t8 = t7 + t5;
t9 = 2*a;
r1 = t7/t9;
r2 = t8/t9;)
```

DFG:

![DFG](http://bears.ece.ucsb.edu/research-info/DP/dfg.figure.id.1.gif)

## 5.3. Control Flow Graph (CFG)

*Control flow graph* is the graphical representation of control flow or computation during the execution of programs or applications.

#### Example

https://en.wikipedia.org/wiki/Control-flow_graph

![](./res/01_cfg_ab.png)

- (a) an if-then-else
- (b) a while loop

![](./res/01_cfg_cd.png)

- (c) a natural loop with two exits, e.g. while with an if...break in the middle
- (d) a loop with two entry points, e.g. goto into a while or for loop

#### python-graphs

- https://github.com/google-research/python-graphs
- https://pypi.org/project/python-graphs/
- https://github.com/google-research/python-graphs/blob/main/examples/control_flow_example.py

In [None]:
def bubbleSort(arr):
    n = len(arr)
    swapped = False
    for i in range(n-1):
        for j in range(0, n-i-1):
            if arr[j] > arr[j + 1]:
                swapped = True
                arr[j], arr[j + 1] = arr[j + 1], arr[j]        
        if not swapped:
            return

fn = bubbleSort

In [None]:
from python_graphs import control_flow

graph = control_flow.get_control_flow_graph(fn)

In [None]:
dir(graph)

In [None]:
from python_graphs import control_flow_graphviz
from python_graphs import program_utils

source = program_utils.getsource(fn)
control_flow_graphviz.render(graph, include_src=source, path=f'cfg_bubble_sort.png')

![](./res/01_cfg_bubble_sort.png)

#### Program graph

- https://github.com/google-research/python-graphs/blob/main/examples/program_graph_example.py

```
from python_graphs import program_graph
from python_graphs import program_graph_graphviz

graph = program_graph.get_program_graph(fn)
program_graph_graphviz.render(graph, path=f'pg.png')
```

# 6. Code metrics

- *You can’t control what you can't measure.*
- *Good metrics don't mean good quality.*

#### [SLOC (LOC)](https://en.wikipedia.org/wiki/Source_lines_of_code)

Source lines of code or lines of code.

*Physical SLOC (LOC)* --- a count of lines in the text of the program's source code excluding comment lines

*Logical SLOC (LLOC)* --- a number of executable "statements"

```
for (i = 0; i < 100; i++) printf("hello"); /* How many lines of code is this? */
```

- 1 physical line of code (LOC)
- 2 logical lines of code (LLOC) (for statement and printf statement)
- 1 comment line

```/* Now how many lines of code is this? */
for (i = 0; i < 100; i++)
{
    printf("hello");
}
```

- 4 physical lines of code (LOC): is placing braces work to be estimated?
- 2 logical lines of code (LLOC): what about all the work writing non-statement lines?
- 1 comment line: tools must account for all code and comments regardless of comment placement

#### Cyclomatic complexity

- [сyclomatic complexity](https://en.wikipedia.org/wiki/Cyclomatic_complexity)
- https://github.com/google-research/python-graphs/blob/main/examples/cyclomatic_complexity_example.py

*Cyclomatic complexity* is a software metric used to indicate the complexity of a program. It is a quantitative measure of the number of linearly independent paths through a program's source code.

Or

$$E − N + 2P,$$

where
- E = the number of edges of the CFG.
- N = the number of nodes of the CFG.
- P = the number of connected components.

In [None]:
from python_graphs import control_flow
from python_graphs import cyclomatic_complexity

graph = control_flow.get_control_flow_graph(fn)
value = cyclomatic_complexity.cyclomatic_complexity(graph)
print(value)

# 7. Code analysis

#### Static and dynamic code analysis

*Static analysis* is the analysis of computer programs performed without executing them.

*Dynamic analysis* is the analysis of computer software that involves executing the program in question.


#### The Halting Problem

> Given a program $P$ and an input $I$, will $P$ halt on $I$?

Assume that there exists an algorithm $H(P,I)$, which solves the halting problem for every program $P$ and input $I$.

Let $X(P)$ be the following alorithm:
1. makes two copies of $P$
2. runs $H(P,P)$
3. if $H(P,P)$ outputs True, $X$ goes into infinite loop and runs forever
4. if $H(P,P)$ outputs False, $X$ terminates

Let's compare $X(X)$ and $H(X, X)$
- If $H(X, X)$ returns True,  then $X$ goes into infinite loop and doesn't halt on $X$. Contradiction.
- If $H(X, X)$ returns False, then $X$ terminates and halts on $X$. Contradiction.

Thus, $H$ doesn't exist.

Rice's theorem:
> Any nontrivial property about the language recognized by a Turing machine is undecidable.

# 8. Exercise

Дан фрагмент кода $S$ на языке Python.
Надо создать фрагмент кода $S'$, отличающийся от $S$ только следующим:
- убраны все комментарии
- табы заменены на пробелы, убраны незначащие пробелы
- названия переменнных, функций, классов заменены на новые: "name_<номер>".

# 9. References

- [IEEE24765-2017 - Systems and software engineering Vocabulary](https://www.cse.msu.edu/~cse435/Handouts/Standards/IEEE24765.pdf)
- [Aho et al - Compilers: Principles, Techniques, and Tools](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools)
- [Abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree)
- [Tree-sitter basics](https://eightify.app/summary/climate-change/tree-sitter-guide-mastering-the-basics)
- [Tree-sitter: traverse a large number of nodes](https://github.com/tree-sitter/py-tree-sitter?tab=readme-ov-file#walking-syntax-trees)
- [Tree-sitter: edit the syntax tree](https://github.com/tree-sitter/py-tree-sitter?tab=readme-ov-file#editing)
- [Data flow graphs](http://bears.ece.ucsb.edu/research-info/DP/dfg.html)
- [Data flow graphs](https://www.youtube.com/watch?v=PmAK_5MrUMg)
- [CodeQL: Analyzing data flow in Python](https://codeql.github.com/docs/codeql-language-guides/analyzing-data-flow-in-python/)
- [Program dependence graph](https://piazza.com/class_profile/get_resource/hy7enxf648g7me/i2qodoub2q73x)
- [Call property graph](https://en.wikipedia.org/wiki/Code_property_graph)
- [Halting problem](https://cs.uwaterloo.ca/~a23gao/cs245_f17/notes/undecidability_solutions.pdf)
- [Rice's theorem](http://kilby.stanford.edu/~rvg/154/handouts/Rice.html)