In [4]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In Pharo-ArchC and related fundamental parts of Smalltalk-25,
we call things of the form (using PPC example here)
```
    addis RT, RA, D
```
_instruction declarations_, and things of the form
```
    addis r3, r1, 0x1234
```
_ground instruction instances_.

We say that two VEX IRSBs _have the same shape_ if they only differ
in the leaf constants.  This means, the `U16`/`U32`/etc constants in `Const`
expressions, but also things like register offsets in `GET` and `PUT`
(because, say, when _RA_ varies those will vary too).

For example, `addis r0, r2, 3` and `lis r0, 0` have different shapes since `lis` ignores `RA` register:

In [42]:
from bitstring import Bits
from regularization.isa import powerpc, arm, Insn

addis_1 = Insn(powerpc.addis, [ Bits('0x3c020003') ])
addis_2 = Insn(powerpc.addis, [ Bits('0x3c000000') ])

def print_diff(obj1, text1, obj2, text2):
    from difflib import HtmlDiff
    from IPython.display import display, HTML
    
    stylHTML = '<style>table.diff td { text-align: left }</style>'
    diffHTML = HtmlDiff(wrapcolumn=80).make_file(str(text1).splitlines(keepends=True),str(text2).splitlines(keepends=True),obj1, obj2)
    display(HTML(stylHTML + diffHTML))
    

print_diff(addis_1.disassembled, addis_1.VEXsig,addis_2.disassembled, addis_2.VEXsig)

Unnamed: 0,"addis r0, r2, 3","addis r0, r2, 3.1",Unnamed: 3,"lis r0, 0","lis r0, 0.1"
f,1,t0 = GET:I32 *,f,1,t0 = GET:I32 *
,2,t1 = GET:I32 *,,2,t1 = GET:I32 *
t,3,"t2 = Add32(t0,Const U32 *)",t,3,t2 = Const U32 *
,4,PUT(*) = t2,,4,PUT(*) = t2
,5,PUT(*) = Const U32 *,,5,PUT(*) = Const U32 *
,6,t3 = GET:I32 *,,6,t3 = GET:I32 *

Legends,Legends.1
Colors Added Changed Deleted,Links (f)irst change (n)ext change (t)op

Colors
Added
Changed
Deleted

Links,Links.1
(f)irst change,
(n)ext change,
(t)op,


This has thedisadvantage that special offsets like PC=1168 on PPC, are not recognized
as special; cf. criticism of ARM uniform SPRs in Waterman's thesis.

Of course, two IRSBs of different shapes can still denote the same
function; in this sense shape is not a hash for homotopy.

An instruction is called _vex-regular_ if all its ground instances
lift to VEX of the same shape.  For example, `bla` on PPC is regular.
However, `addis` is irregular, because in the special case of _RA_=0
VEX short-circuits the `Add32` binop.

Therefore, the equality ralation on VEX shapes classifies the total space of instances into disjoint shape classes.  The 
class `VEXShapeAnalysis` computes a section of the total instance-encoding space: out of each shape class, it picks one representative.  It returns the list of these shapes along with their representatives:

In [31]:
from regularization.vexshape import VEXShapeAnalysis
analysis = VEXShapeAnalysis(powerpc.addis)

In [43]:
analysis.run(100000)

print(f"Found {len(analysis.shapes)} shapes so far:")
for shape in analysis.shapes:
    print(shape.example)

Analyzing addis:   0%|          | 0/100000 [00:00<?, ?it/s]

Found 2 shapes so far:
Insn(Bits('0x3c010000')) # addis r0, r1, 0
Insn(Bits('0x3c000000')) # lis r0, 0


Note how different ISAs differ in terminology regarding what is an
instruction, a page, or an extended mnemonic -- and how ArchC reflects
these differences.  Take the branch instruction as an example.  The PPC
"Branch I-Form" instructions (`b`, `ba`, `bl`, `bla`) form a single
`# Branch` page but are considered separate instructions -- the `LK` and
'AA' bits are part of the decoder; this is especially evident in the
ArchC model.  Contrast this with the `H` bit in ARM `b` instruction:
`b` and `bl` are considered extended mnemonics of the same `b` instruction.
One can think of editing the ISA to split `b` and `bl` into separate
instructions.  If one goes on far enough, one can arrive at an ISA
formulation where all instructions are vex-regular.  We call this process
_vex-regularization_.  Obviously, decoder functions in this regularized
ISA will not be nicely aligned along the bit boundaries; instruction
decode will include some _guard predicates_, e.g. PPC `addis` above will
have guards _RA_==0, _RA_!=0.