We have now computed, for all random forests and all frequency thresholds, 

- the frequent patterns ( Initial Rooted Frequent Subtree Mining (without embedding computation).ipynb )
- all embeddings for each frequent pattern up to size 6 ( Find All Occurrences of All Frequent Patterns of Size up to 6.ipynb )

Thus, we have lots of files that store the random forests and some embedding information of the patterns and files that contain pattern info.

By loading a pair of files, e.g.,

    forests/rootedFrequentTrees/adult/WithLeafEdges/leq6/RF_10_t16_allEmbeddings.json
    forests/rootedFrequentTrees/adult/WithLeafEdges/leq6/RF_10_t16.json
    
We have all the information necessary to see how much it helps us to replace all subtrees in a RF corresponding to a pattern with a function.   

In [1]:
import sys
import SubtreeSelection.PatternWeights as pw

In [2]:
embeddingFile = '/home/pascal/Documents/Uni_synced/random_forests/forests/rootedFrequentTrees/adult/WithLeafEdges/leq6/RF_10_t16_allEmbeddings.json'
patternFile = '/home/pascal/Documents/Uni_synced/random_forests/forests/rootedFrequentTrees/adult/WithLeafEdges/leq6/RF_10_t16.json'

dataSupportStatistics = pw.PatternStatisticsFromPrecomputed(patternFile, embeddingFile, 'data_support')
singleNodeCompressionStatistics = pw.PatternStatisticsFromPrecomputed(patternFile, embeddingFile, 'single_node_compression')
frequencyStatistics = pw.PatternStatisticsFromPrecomputed(patternFile, embeddingFile, 'frequency')

In [3]:
frequencyStatistics.most_common_pattern_ids()

[(14, 50),
 (18, 6),
 (127, 6),
 (125, 6),
 (173, 6),
 (25, 4),
 (111, 4),
 (28, 4),
 (6, 3),
 (149, 3)]

In [4]:
frequencyStatistics.most_common_patterns_string()

["14: w=50, p={'id': 0, 'prediction': []}",
 "18: w=6, p={'id': 0, 'feature': 0}",
 "127: w=6, p={'id': 0, 'feature': 0, 'leftChild': {'id': 1, 'prediction': []}}",
 "125: w=6, p={'id': 0, 'feature': 0, 'rightChild': {'id': 1, 'prediction': []}}",
 "173: w=6, p={'id': 0, 'feature': 0, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "25: w=4, p={'id': 0, 'feature': 27}",
 "111: w=4, p={'id': 0, 'feature': 27, 'leftChild': {'id': 1, 'prediction': []}}",
 "28: w=4, p={'id': 0, 'feature': 50}",
 "6: w=3, p={'id': 0, 'feature': 29}",
 "149: w=3, p={'id': 0, 'feature': 29, 'rightChild': {'id': 1, 'prediction': []}}"]

As we can see, (and as was to be expected) the pattern that occurs most frequently is rather small. 
I.e., it consists of a single vertex.

This of course is not a useful pattern to be replaced by a function call.

Let's try a different weight function, that tells us, how much vertices we could save in the RF by contracting each embedding of a pattern into a single vertex:

In [5]:
singleNodeCompressionStatistics.most_common_pattern_ids()

[(173, 6),
 (176, 1),
 (159, 1),
 (182, 1),
 (191, 1),
 (167, 1),
 (18, 0),
 (127, 0),
 (125, 0),
 (14, 0)]

## Questions:

- What are useful weighting functions for our actual scenario?
- How do we deal with overlaps of the embeddings?
  - Is it more useful to process patterns that are higher up in the DT first? (I guess)
- For some strange reason, the embedding file above is not correct.
  - See how the pattern file works correctly...

In [6]:
transactionFile = '/home/pascal/Documents/Uni_synced/random_forests/forests/adult/text/RF_10.json'
patternFile = '/home/pascal/Documents/Uni_synced/random_forests/forests/rootedFrequentTrees/adult/WithLeafEdges/leq6/RF_10_t16.json'

dataSupportStatisticsSkratch = pw.PatternStatisticsFromSkratch(patternFile, transactionFile, 'data_support')
singleNodeCompressionStatisticsSkratch = pw.PatternStatisticsFromSkratch(patternFile, transactionFile, 'single_node_compression')
frequencyStatisticsSkratch = pw.PatternStatisticsFromSkratch(patternFile, transactionFile, 'frequency')

In [7]:
dataSupportStatisticsSkratch.most_common_patterns_string()

["152: w=70425, p={'id': 0, 'feature': 61, 'rightChild': {'id': 1, 'prediction': []}}",
 "141: w=47721, p={'id': 0, 'feature': 63, 'leftChild': {'id': 1, 'prediction': []}}",
 "127: w=39455, p={'id': 0, 'feature': 0, 'leftChild': {'id': 1, 'prediction': []}}",
 "143: w=36641, p={'id': 0, 'feature': 63, 'rightChild': {'id': 1, 'prediction': []}}",
 "114: w=36292, p={'id': 0, 'feature': 62, 'rightChild': {'id': 1, 'prediction': []}}",
 "164: w=34024, p={'id': 0, 'feature': 63, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "139: w=33990, p={'id': 0, 'feature': 63, 'rightChild': {'id': 1, 'feature': 0}}",
 "125: w=31140, p={'id': 0, 'feature': 0, 'rightChild': {'id': 1, 'prediction': []}}",
 "136: w=28909, p={'id': 0, 'feature': 26, 'leftChild': {'id': 1, 'prediction': []}}",
 "134: w=28038, p={'id': 0, 'feature': 26, 'rightChild': {'id': 1, 'prediction': []}}"]

In [8]:
singleNodeCompressionStatisticsSkratch.most_common_patterns_string()

["181: w=310, p={'id': 0, 'feature': 9, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "173: w=301, p={'id': 0, 'feature': 0, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "164: w=266, p={'id': 0, 'feature': 63, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "159: w=156, p={'id': 0, 'feature': 61, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "167: w=117, p={'id': 0, 'feature': 26, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "178: w=90, p={'id': 0, 'feature': 62, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "185: w=89, p={'id': 0, 'feature': 1, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "175: w=53, p={'id': 0, 'feature': 54, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'pr

In [9]:
frequencyStatisticsSkratch.most_common_patterns_string()

["14: w=7405, p={'id': 0, 'prediction': []}",
 "18: w=699, p={'id': 0, 'feature': 0}",
 "30: w=622, p={'id': 0, 'feature': 9}",
 "10: w=606, p={'id': 0, 'feature': 63}",
 "5: w=473, p={'id': 0, 'feature': 61}",
 "104: w=399, p={'id': 0, 'feature': 9, 'leftChild': {'id': 1, 'prediction': []}}",
 "127: w=387, p={'id': 0, 'feature': 0, 'leftChild': {'id': 1, 'prediction': []}}",
 "102: w=382, p={'id': 0, 'feature': 9, 'rightChild': {'id': 1, 'prediction': []}}",
 "12: w=377, p={'id': 0, 'feature': 26}",
 "125: w=370, p={'id': 0, 'feature': 0, 'rightChild': {'id': 1, 'prediction': []}}"]