Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Line number storage scheme is inefficient #648
In the following dump:
Here is a histogram showing the frequency of having to skip n bytes and n lines. The data is collected from the CPython 3.4 stdlib, being 111,000 lines of Python code, and gives 125,000 data points (accumulated to a histogram). Note the log-log scale.
Read it as follows: more than 50% of the time, you just need to skip 1 line (the blue curve is highest at 1). The most common number of bytes to skip is 5 (the red curve is highest at 5). 0.1% of the time you need to skip more than 30 lines.
Now much more efficient.
Based on the above statistics (using CPython3.4 stdlib), I got it close to optimal for this suite of scripts. Original method used 13 bytes per (bytecode, line number) pair; new method uses just 1.3 on average. The method is tuned for this set of scripts, but they should be indicative of general Python scripts.
In the new method, there are 2 encoding schemes: a 1 byte one, and a 2 byte one. The 1 byte one is used for 75% of the (bytecode, line number) pairs.
Note that there are a decent amount of cases (around 10%) where large line skips occur within the function (not just the first line of the function being a large number). Thus, it's important to have the coding scheme work within functions, not just to optimise the first line being large. Also, it's simpler code not having a special case for the first function.