# Statistical Breakdown

I will parse the NumPy arrays of my [sample](./samples/) documents to determine the parents for unknown blocks in side [blockset](./blockset/).

For each document, I am populating a [`dataclass`](./searches/character_types.py) rule with the following:

* `start`: Minimum rule match
* `end`: Maximum rule match
* `match_percent`: Total number of matches per block.

For an example:

* `start`: 65
* `end`: 122
* `match_percent`: 70

This rule will compare all non-empty blocks (`255` in the `np.uint8` encoding) of an `np.array` and match anything greater than or equal to 65 and less than or equal to 122 (this is all the Latin alphabet letters in lowercase and uppercase format). If the total number of non-empty blocks that match is greater than 70% then the block is childed to the relevant parent.

In [51]:
from itertools import count
import numpy as np
import pandas as pd

sample_files = ["samples/sample.doc", "samples/sample.pdf", "samples/sample.jpeg"]

sample_data = []
for sample in sample_files:
    with open(sample) as f:
        sample_data.append((sample, np.fromfile(f, dtype=np.uint8)))

for name, data in sample_data:
    print(f"\n---\n\t{name}\n---")

    # Drop high and low (empties or extraneous)
    clean = np.array(data[np.argwhere((data > 1) & (data < 253))])

    # Set a threshold -- we only care about values < 10% of the dataset
    threshold = len(clean)*0.005
    print(f"Total values:\t{len(clean)}\nThreshold:\t{threshold}")

    out = pd.DataFrame(
        np.rot90(np.unique(clean, return_counts=True)), columns=["value", "count"]
    )

    pass_threshold = out.loc[out['count'] >= threshold].sort_values(by=['value'], ascending=True)
    print(f"Meets threshold:\n{pass_threshold}")

    # largest_n = out.sort_values(by=["count"])[-10:]
    # result = largest_n.sort_values(by=["value"], ascending=False).to_string(index=False)
    # print(largest_n.sort_values(by=['count'], ascending=False))
    


---
	samples/sample.doc
---
Total values:	193444
Threshold:	967.22
Meets threshold:
     value  count
249      3   1061
245      7    996
237     15   1024
220     32   1489
218     34   1332
205     47   1005
194     58   1203
192     60   1079
191     61   1045
190     62   1095
155     97   1396
153     99   1284
151    101   1414
147    105   1423
144    108   1049
143    109   1254
142    110   1298
141    111   1361
140    112    985
138    114   1439
137    115   1473
136    116   1350
133    119   1091
53     199    982

---
	samples/sample.pdf
---
Total values:	1170279
Threshold:	5851.395
Meets threshold:
     value  count
237     15   5856
236     16   5871
221     31   6340
220     32   6831
189     63   7213
125    127   6515
12     240   6027
4      248   6398
0      252   7276

---
	samples/sample.jpeg
---
Total values:	7018
Threshold:	35.09
Meets threshold:
     value  count
250      2     63
249      3     52
248      4     41
244      8     47
243      9     36
236   