We have now computed, for all random forests and all frequency thresholds, 

- the frequent patterns ( Initial Rooted Frequent Subtree Mining (without embedding computation).ipynb )
- all embeddings for each frequent pattern up to size 6 ( Find All Occurrences of All Frequent Patterns of Size up to 6.ipynb )

Thus, we have lots of files that store the random forests and some embedding information of the patterns and files that contain pattern info.

By loading a pair of files, e.g.,

    forests/rootedFrequentTrees/adult/WithLeafEdges/leq6/ET_10_t16_allEmbeddings.json
    forests/rootedFrequentTrees/adult/WithLeafEdges/leq6/ET_10_t16.json
    
We have all the information necessary to see how much it helps us to replace all subtrees in a RF corresponding to a pattern with a function.   

In [9]:
import json
from collections import Counter


def embeddingStatsRec(transaction, weightfunction, counter):
    if 'feature' in transaction.keys():
        if 'leftChild' in transaction.keys():
            embeddingStatsRec(transaction['leftChild'], weightfunction, counter)
        if 'rightChild' in transaction.keys():
            embeddingStatsRec(transaction['rightChild'], weightfunction, counter)
    
    for pattern in transaction['patterns']:
        counter[pattern[0]] += weightfunction(pattern, transaction)

    

def embeddingStats(transactions, weightfunction):
    '''Apply weightfunction to all embeddings of each pattern in each transaction and return, for each pattern,
    the sum of gains.
    
    Due to implementation, weightfunction must 
    - take a pattern and a transaction vertex (the root of the embedding being currently weighted) as input and 
    - output an int.
    '''
    cnt = Counter()
    for transaction in transactions:
        embeddingStatsRec(transaction, weightfunction, cnt)
    return cnt
        

In [2]:
# load some patterns and a random forest with all their embeddings

embeddingFile = open('/home/pascal/Documents/Uni_synced/random_forests/forests/rootedFrequentTrees/adult/WithLeafEdges/leq6/ET_10_t16_allEmbeddings.json')
patternFile = open('/home/pascal/Documents/Uni_synced/random_forests/forests/rootedFrequentTrees/adult/WithLeafEdges/leq6/ET_10_t16.json')

embeddingInfo = json.load(embeddingFile)
patterns = json.load(patternFile)

embeddingFile.close()
patternFile.close()

In [None]:
# transform the list of patterns to a dict for easy access

patternDict = { pattern['patternid'] : pattern['pattern'] for pattern in patterns}

In [12]:
# find the patterns that occur in at least 16 of the trees in ET_10 with most embeddings in ET_10
# that is, each embedding counts for 1

embeddingCounts = embeddingStats(embeddingInfo, weightfunction=lambda p,t : 1)
embeddingCounts.most_common(10)

[(10, 8177),
 (3, 566),
 (1, 548),
 (52, 521),
 (22, 508),
 (155, 489),
 (75, 373),
 (6, 350),
 (157, 339),
 (156, 332)]

In [14]:
# we can access the corresponding patterns like this

patternDict[10]


{'id': 0, 'prediction': []}

As we can see, (and as was to be expected) the pattern that occurs most frequently is rather small. 
I.e., it consists of a single vertex.

This of course is not a useful pattern to be replaced by a function call.

Let's try a different weight function, that tells us, how much vertices we could save in the RF by contracting each embedding of a pattern into a single vertex:

In [17]:
embeddingSavings = embeddingStats(embeddingInfo, weightfunction=lambda p,t : len(p[1]) - 1)
embeddingSavings.most_common(10)

[(193, 552),
 (160, 532),
 (155, 489),
 (174, 450),
 (161, 434),
 (75, 373),
 (157, 339),
 (156, 332),
 (73, 332),
 (120, 311)]

In [30]:
list(map(lambda x: '{0}: w={1}, p={2}'.format(x[0], x[1], patternDict[x[0]]), embeddingSavings.most_common(10)))

["193: w=552, p={'id': 0, 'feature': 9, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "160: w=532, p={'id': 0, 'feature': 0, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "155: w=489, p={'id': 0, 'feature': 61, 'rightChild': {'id': 1, 'prediction': []}}",
 "174: w=450, p={'id': 0, 'feature': 63, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "161: w=434, p={'id': 0, 'feature': 61, 'leftChild': {'id': 1, 'prediction': []}, 'rightChild': {'id': 2, 'prediction': []}}",
 "75: w=373, p={'id': 0, 'feature': 9, 'rightChild': {'id': 1, 'prediction': []}}",
 "157: w=339, p={'id': 0, 'feature': 0, 'rightChild': {'id': 1, 'prediction': []}}",
 "156: w=332, p={'id': 0, 'feature': 0, 'leftChild': {'id': 1, 'prediction': []}}",
 "73: w=332, p={'id': 0, 'feature': 9, 'leftChild': {'id': 1, 'prediction': []}}",
 "120: w=311, p={'id': 0, 'feature': 63, 'leftChild': {'id': 1, 'pre

In [19]:
patternDict[193]

{'feature': 9,
 'id': 0,
 'leftChild': {'id': 1, 'prediction': []},
 'rightChild': {'id': 2, 'prediction': []}}

## Questions:

- What are useful weighting functions for our actual scenario?
- How do we deal with overlaps of the embeddings?
  - Is it more useful to process patterns that are higher up in the DT first? (I guess)