theme | size | inlineSVG | footer |
---|---|---|---|
gaia |
58140 |
true |
Gabriele Modena, 2020-09-23 |
git clone https://github.com/gmodena/pycdump/
- Assume Python = CPython
- Assume running on x86_64
- How does CPython execute (byte) code?
- How do pyc files look like?
- What tools are available to analyse bytecode?
- Get a better understanding of technology I use every day
- Documentation is sometimes lacking / often out of date
- Python introspection has a lot of capabilities to offer
- There some practical use cases / considerations
<style scoped> img { display: block; margin-left: auto; margin-right: auto; } </style>
def sum(a, b):
return a + b
type(sum)
sum.__code__ # a code object to be executed
[co for co in sum.__code__.co_code] # raw compiled bytecode strings
import dis
dis.dis(sum)
<style scoped> img { display: block; margin-left: auto; margin-right: auto; } </style>
Code objects provide these attributes (and several more):
- co_code string of raw compiled bytecode
- co_filename name of file in which this code object was created
- co_varnames tuple of names of arguments and local variables
- co_consts a nested data structure that can contain code objects
$ cat example.py
a, b = 1, 0
if a or b:
print("Hello", a)
Given its bytecode represenation, can we regenerate the source?
$ python -m compileall example.py
A pyc
looks like... bytes
$ cat __pycache__/example.cpython-37.pyc
B
?S"^=?@s*d\ZZeserede?iZded<dS))??ZHellorrN)?a?b?print?c?rr?
example.py<module>s
<style scoped> pre { font-size: 48%; } </style>
FIELD_SIZE = 4 # 32 // 8
def main(fname):
with open(fname, "rb") as infile:
# Header: bytes 0 - 3
magic_number = binascii.hexlify(infile.read(FIELD_SIZE))
# Header: bytes 4 - 7
bit_field = infile.read(FIELD_SIZE)
# Header: bytes 8 - 11
moddate = infile.read(FIELD_SIZE)
# Header: bytes 12 - 15
source_size = infile.read(FIELD_SIZE)
modtime = time.asctime(time.localtime(struct.unpack("=L", moddate)[0]))
source_size = struct.unpack("=L", source_size)
# Payload : bytes 16 - ...
code_obj = marshal.load(infile)
frames = dump(code_obj)
for tpl in frames:
dis.disassemble(tpl[2])
<style scoped> pre { font-size: 75%; } </style>
def dump(code_obj):
frames = []
def ddump(code_obj):
for const in code_obj.co_consts:
if isinstance(const, CodeType):
ddump(const)
frames.append((code_obj.co_filename, code_obj.co_firstlineno, code_obj))
ddump(code_obj)
frames.sort(key=lambda tpl: tpl[1])
return frames
$ python dump.py __pycache__/example.cpython-37.pyc
From https://coverage.readthedocs.io/en/coverage-5.3/howitworks.html
After your program has been executed and the line numbers recorded, coverage.py needs to determine what lines could have been executed. Luckily, compiled Python files (.pyc files) have a table of line numbers in them. Coverage.py reads this table to get the set of executable lines
- Reversing a Simple Python Ransomware https://or10nlabs.tech/reversing-a-simple-python-ransomware/
- Mask payload binary patches by tampering with the header
https://www.python.org/dev/peps/pep-0552/
- Reproducible builds (same input code generates the same pyc files)
- Timestamp header field makes pyc non deterministic
- PEP proposes allowing the timestamp to be replaced with a deterministic hash
- https://github.com/gmodena/pycdump/
- https://nedbatchelder.com/blog/200804/the_structure_of_pyc_files.html
- https://docs.python.org/3/c-api/code.html
<style scoped> img { display: block; margin-left: auto; margin-right: auto; } </style>