# Finding the Embeddings of the Frequent Patterns

It should be easy (and fast enough) to find all embeddings of a few frequent patterns in the random forest database.

To do this, we need to 

- [x] select the right patterns (e.g. those of size six)
- [x] read my canonical string format in python / transform it to json
- [x] read the json database format in python
- [ ] write a small script that traverses the random forests and finds all embeddings and outputs them somehow

## Select the Right Patterns (e.g. those of size six)


In [2]:
# pattern size in number of vertices
patternSize = 4
# here, we count the number of edges in the pattern
patternSize = patternSize - 1 

filename = 'forests/rootedFrequentTrees/adult/WithLeafEdges/RF_10_t10.patterns'
f = open(filename)

# gives us the patterns of the selected size
frequentPatterns = filter(lambda line: line.count('(') == patternSize, f)

# gives us only the canonical strings of the patterns
cStrings = map(lambda fp: fp.split('\t')[2], frequentPatterns)

cStringList = list(cStrings)
print(cStringList)

f.close()

['61 ( leftChild 63 ( leftChild leaf ) ) ( rightChild leaf ) \n', '63 ( leftChild 0 ( leftChild leaf ) ( rightChild leaf ) ) \n', '63 ( leftChild 0 ( leftChild leaf ) ) ( rightChild leaf ) \n', '63 ( rightChild 9 ( leftChild leaf ) ( rightChild leaf ) ) \n', '63 ( rightChild 0 ( leftChild leaf ) ( rightChild leaf ) ) \n', '26 ( rightChild 9 ( leftChild leaf ) ( rightChild leaf ) ) \n', '0 ( leftChild 9 ( leftChild leaf ) ( rightChild leaf ) ) \n', '0 ( leftChild 9 ( leftChild leaf ) ) ( rightChild leaf ) \n', '0 ( rightChild 0 ( leftChild leaf ) ( rightChild leaf ) ) \n', '0 ( rightChild 63 ( leftChild leaf ) ( rightChild leaf ) ) \n', '0 ( rightChild 9 ( leftChild leaf ) ( rightChild leaf ) ) \n', '9 ( leftChild leaf ) ( rightChild 0 ( rightChild leaf ) ) \n', '9 ( rightChild 0 ( leftChild leaf ) ( rightChild leaf ) ) \n', '9 ( rightChild 9 ( leftChild leaf ) ( rightChild leaf ) ) \n']


In [None]:
import json

## transform my canonical string format to json

In [3]:
s = '61 ( leftChild 63 ( leftChild leaf ) ) ( rightChild leaf ) \n'



def cString2json(cString):
    '''Pascals canonical string format and the json format used in Dortmund are 
    basically identical (up to ordering, symbols, and general feeling of course ;) ).
    This is a converter that transforms a single tree from cString format to json format 
    (entirely by string maipulation).'''
    
    intermediate = cString.replace('( leftChild', ',"leftChild":{').replace('( rightChild', ',"rightChild":{').replace(')', '}').replace('leaf', '-1 "prediction":[]')
    tokens = intermediate.split(' ')
    
    json = ''
    i = 0
    for t in tokens:
        try:
            feature = int(t)
            if feature != -1:
                s = '"id":' + str(i) + ',"feature":' + t
            else:
                s = '"id":' + str(i) + ','
            json += s
            i += 1
        except ValueError:
            json += t
            
            
    return ('{' + json.rstrip() + '}')
    
cString2json(s)

'{"id":0,"feature":61,"leftChild":{"id":1,"feature":63,"leftChild":{"id":2,"prediction":[]}},"rightChild":{"id":3,"prediction":[]}}'

In [4]:
jsons = '[' + ',\n'.join(map(cString2json, cStringList)) + ']'
print(jsons)

[{"id":0,"feature":61,"leftChild":{"id":1,"feature":63,"leftChild":{"id":2,"prediction":[]}},"rightChild":{"id":3,"prediction":[]}},
{"id":0,"feature":63,"leftChild":{"id":1,"feature":0,"leftChild":{"id":2,"prediction":[]},"rightChild":{"id":3,"prediction":[]}}},
{"id":0,"feature":63,"leftChild":{"id":1,"feature":0,"leftChild":{"id":2,"prediction":[]}},"rightChild":{"id":3,"prediction":[]}},
{"id":0,"feature":63,"rightChild":{"id":1,"feature":9,"leftChild":{"id":2,"prediction":[]},"rightChild":{"id":3,"prediction":[]}}},
{"id":0,"feature":63,"rightChild":{"id":1,"feature":0,"leftChild":{"id":2,"prediction":[]},"rightChild":{"id":3,"prediction":[]}}},
{"id":0,"feature":26,"rightChild":{"id":1,"feature":9,"leftChild":{"id":2,"prediction":[]},"rightChild":{"id":3,"prediction":[]}}},
{"id":0,"feature":0,"leftChild":{"id":1,"feature":9,"leftChild":{"id":2,"prediction":[]},"rightChild":{"id":3,"prediction":[]}}},
{"id":0,"feature":0,"leftChild":{"id":1,"feature":9,"leftChild":{"id":2,"predic

In [19]:
def parseCStringFile(fileName, patternSize):
	# here, we count the number of edges in the pattern
	patternSize = patternSize - 1 

	f = open(fileName)

	# gives us the patterns of the selected size
	frequentPatterns = filter(lambda line: line.count('(') == patternSize, f)

	# splits the strings into fields
	tokens = map(lambda fp: fp.split('\t'), frequentPatterns)

	# gives us only the canonical strings of the patterns and their id
	pairs = map(lambda t: (t[1], t[2]), tokens)

	# transform to json strings
	jsonCStrings = map(lambda pair: '{"patternid":' + pair[0] + ',"pattern":' + cString2json(pair[1]) + '}', pairs)

	jsonBlob = '[' + ',\n'.join(jsonCStrings) + ']'

	f.close()

	return jsonBlob

json.loads(parseCStringFile('forests/rootedFrequentTrees/adult/WithLeafEdges/RF_10_t10.patterns', 4))

[{'pattern': {'feature': 61,
   'id': 0,
   'leftChild': {'feature': 63,
    'id': 1,
    'leftChild': {'id': 2, 'prediction': []}},
   'rightChild': {'id': 3, 'prediction': []}},
  'patternid': 339},
 {'pattern': {'feature': 63,
   'id': 0,
   'leftChild': {'feature': 0,
    'id': 1,
    'leftChild': {'id': 2, 'prediction': []},
    'rightChild': {'id': 3, 'prediction': []}}},
  'patternid': 337},
 {'pattern': {'feature': 63,
   'id': 0,
   'leftChild': {'feature': 0,
    'id': 1,
    'leftChild': {'id': 2, 'prediction': []}},
   'rightChild': {'id': 3, 'prediction': []}},
  'patternid': 336},
 {'pattern': {'feature': 63,
   'id': 0,
   'rightChild': {'feature': 9,
    'id': 1,
    'leftChild': {'id': 2, 'prediction': []},
    'rightChild': {'id': 3, 'prediction': []}}},
  'patternid': 338},
 {'pattern': {'feature': 63,
   'id': 0,
   'rightChild': {'feature': 0,
    'id': 1,
    'leftChild': {'id': 2, 'prediction': []},
    'rightChild': {'id': 3, 'prediction': []}}},
  'patternid': 

## write a small script that traverses the random forests and finds all embeddings and outputs them somehow

In [5]:
import json
import sys

j = json.loads(jsons)
print(json.dumps(j, sort_keys=True, indent=4))

[
    {
        "feature": 61,
        "id": 0,
        "leftChild": {
            "feature": 63,
            "id": 1,
            "leftChild": {
                "id": 2,
                "prediction": []
            }
        },
        "rightChild": {
            "id": 3,
            "prediction": []
        }
    },
    {
        "feature": 63,
        "id": 0,
        "leftChild": {
            "feature": 0,
            "id": 1,
            "leftChild": {
                "id": 2,
                "prediction": []
            },
            "rightChild": {
                "id": 3,
                "prediction": []
            }
        }
    },
    {
        "feature": 63,
        "id": 0,
        "leftChild": {
            "feature": 0,
            "id": 1,
            "leftChild": {
                "id": 2,
                "prediction": []
            }
        },
        "rightChild": {
            "id": 3,
            "prediction": []
        }
    },
    {
        "feature": 63,
 

There must be a smarter way than the following. 
However, this is rather easy to implement:
We iterate (recursively) over the transaction (decision) tree vertices $v$ and check whether there is a rooted subgraph isomorphism from pattern to the transaction mapping the root of pattern to $v$.
This is decided by (again) recursion over pattern and transaction simultaneously, as long as it fits.

We'll see whether this is fast enough for our case. Its something along $O(n * p)$ where $n$ and $p$ are the numbers of vertices of transactions and patterns, respectively.

In [6]:
def printInfo(pattern, transaction, where):
    print(where)
    print('p: ' + str(pattern))
    print('t: ' + str(transaction))


def recSearch(pattern, transaction, mapping):
    # check if we are in a leaf vertex in both pattern and transaction
    if 'prediction' in pattern.keys() and 'prediction' in transaction.keys():
        mapping[pattern['id']] = transaction['id']
        return True
    
    # check if we are in a split vertex in both pattern and transaction
    if 'feature' in pattern.keys() and 'feature' in transaction.keys():

        # check if split features match
        if pattern['feature'] == transaction['feature']:
            
            foundLeft = True
            foundRight = True
            if 'leftChild' in pattern.keys():
                if 'leftChild' in transaction.keys():
                    foundLeft = recSearch(pattern['leftChild'], transaction['leftChild'], mapping)    
                else:
                    foundLeft = False
                
            if 'rightChild' in pattern.keys():
                if 'rightChild' in transaction.keys():
                    foundRight = recSearch(pattern['rightChild'], transaction['rightChild'], mapping)
                else:
                    foundRight = False
                
            if foundLeft and foundRight:
                mapping[pattern['id']] = transaction['id']
                return True
            else:
                return False
            
    # if we are in the mixed case split vertex vs. leaf vertex then we cannot map the vertices on each other
    return False


def searchEmbedding(pattern, transaction):
    '''For two given root vertices, check whether pattern is a rooted 
    subtree and return a mapping id->id if so, o/w None'''
    
    mapping = dict()
    if recSearch(pattern, transaction, mapping):
        return mapping
    else:
        return None
    
    
def allEmbeddingsRec(pattern, transaction, result):
    print('process transaction vertex id ' + str(transaction['id']))
    if 'feature' in transaction.keys():
        if 'leftChild' in transaction.keys():
            allEmbeddingsRec(pattern, transaction['leftChild'], result)
        if 'rightChild' in transaction.keys():
            allEmbeddingsRec(pattern, transaction['rightChild'], result)
    
    result.append((transaction['id'], searchEmbedding(pattern, transaction)))
    return result


def allEmbeddings(pattern, transaction):
    embeddingsAndNone = allEmbeddingsRec(pattern, transaction, list())
    return embeddingsAndNone #list(filter(lambda x: x != None, embeddingsAndNone))


In [7]:
for transaction in j:
    print(allEmbeddings(j[0], transaction))

process transaction vertex id 0
process transaction vertex id 1
process transaction vertex id 2
process transaction vertex id 3
[(2, None), (1, None), (3, None), (0, {2: 2, 1: 1, 3: 3, 0: 0})]
process transaction vertex id 0
process transaction vertex id 1
process transaction vertex id 2
process transaction vertex id 3
[(2, None), (3, None), (1, None), (0, None)]
process transaction vertex id 0
process transaction vertex id 1
process transaction vertex id 2
process transaction vertex id 3
[(2, None), (1, None), (3, None), (0, None)]
process transaction vertex id 0
process transaction vertex id 1
process transaction vertex id 2
process transaction vertex id 3
[(2, None), (3, None), (1, None), (0, None)]
process transaction vertex id 0
process transaction vertex id 1
process transaction vertex id 2
process transaction vertex id 3
[(2, None), (3, None), (1, None), (0, None)]
process transaction vertex id 0
process transaction vertex id 1
process transaction vertex id 2
process transaction

In [8]:
leaf = json.loads('{"id": 1,"prediction": []}')
for transaction in j:
    print(transaction)
    print(allEmbeddings(leaf, transaction))

{'id': 0, 'feature': 61, 'leftChild': {'id': 1, 'feature': 63, 'leftChild': {'id': 2, 'prediction': []}}, 'rightChild': {'id': 3, 'prediction': []}}
process transaction vertex id 0
process transaction vertex id 1
process transaction vertex id 2
process transaction vertex id 3
[(2, {1: 2}), (1, None), (3, {1: 3}), (0, None)]
{'id': 0, 'feature': 63, 'leftChild': {'id': 1, 'feature': 0, 'leftChild': {'id': 2, 'prediction': []}, 'rightChild': {'id': 3, 'prediction': []}}}
process transaction vertex id 0
process transaction vertex id 1
process transaction vertex id 2
process transaction vertex id 3
[(2, {1: 2}), (3, {1: 3}), (1, None), (0, None)]
{'id': 0, 'feature': 63, 'leftChild': {'id': 1, 'feature': 0, 'leftChild': {'id': 2, 'prediction': []}}, 'rightChild': {'id': 3, 'prediction': []}}
process transaction vertex id 0
process transaction vertex id 1
process transaction vertex id 2
process transaction vertex id 3
[(2, {1: 2}), (1, None), (3, {1: 3}), (0, None)]
{'id': 0, 'feature': 63,

## Script, without Debug Info

In [9]:
def printInfo(pattern, transaction, where):
    print(where)
    print('p: ' + str(pattern))
    print('t: ' + str(transaction))


def recSearch(pattern, transaction, mapping):
    # check if we are in a leaf vertex in both pattern and transaction
    if 'prediction' in pattern.keys() and 'prediction' in transaction.keys():
        mapping[pattern['id']] = transaction['id']
        return True
    
    # check if we are in a split vertex in both pattern and transaction
    if 'feature' in pattern.keys() and 'feature' in transaction.keys():

        # check if split features match
        if pattern['feature'] == transaction['feature']:
            
            foundLeft = True
            foundRight = True
            if 'leftChild' in pattern.keys():
                if 'leftChild' in transaction.keys():
                    foundLeft = recSearch(pattern['leftChild'], transaction['leftChild'], mapping)    
                else:
                    foundLeft = False
                
            if 'rightChild' in pattern.keys():
                if 'rightChild' in transaction.keys():
                    foundRight = recSearch(pattern['rightChild'], transaction['rightChild'], mapping)
                else:
                    foundRight = False
                
            if foundLeft and foundRight:
                mapping[pattern['id']] = transaction['id']
                return True
            else:
                return False
            
    # if we are in the mixed case split vertex vs. leaf vertex then we cannot map the vertices on each other
    return False


def searchEmbedding(pattern, transaction):
    '''For two given root vertices, check whether pattern is a rooted 
    subtree and return a mapping id->id if so, o/w None'''
    
    mapping = dict()
    if recSearch(pattern, transaction, mapping):
        return mapping
    else:
        return None
    
    
def allEmbeddingsRec(pattern, transaction, result):
    if 'feature' in transaction.keys():
        if 'leftChild' in transaction.keys():
            allEmbeddingsRec(pattern, transaction['leftChild'], result)
        if 'rightChild' in transaction.keys():
            allEmbeddingsRec(pattern, transaction['rightChild'], result)
    
    result.append(searchEmbedding(pattern, transaction))
    return result


def allEmbeddings(pattern, transaction):
    embeddingsAndNone = allEmbeddingsRec(pattern, transaction, list())
    return list(filter(lambda x: x != None, embeddingsAndNone))



# def main(file, out):
#     f = open(file)
#     j = json.load(f)
#     f.close()

#     graphCounter = 0
#     for tree in j:
#         vertexLabels, edges = parseTree(tree)
#         transform2GraphDB(vertexLabels, edges, graphCounter, out)
#         graphCounter += 1
#     print('$')

# if __name__ == '__main__':
#     main(sys.argv[1], sys.stdout)

In [10]:
leaf = json.loads('{"id": 1,"prediction": []}')
for transaction in j:
    print(transaction)
    print(allEmbeddings(leaf, transaction))

{'id': 0, 'feature': 61, 'leftChild': {'id': 1, 'feature': 63, 'leftChild': {'id': 2, 'prediction': []}}, 'rightChild': {'id': 3, 'prediction': []}}
[{1: 2}, {1: 3}]
{'id': 0, 'feature': 63, 'leftChild': {'id': 1, 'feature': 0, 'leftChild': {'id': 2, 'prediction': []}, 'rightChild': {'id': 3, 'prediction': []}}}
[{1: 2}, {1: 3}]
{'id': 0, 'feature': 63, 'leftChild': {'id': 1, 'feature': 0, 'leftChild': {'id': 2, 'prediction': []}}, 'rightChild': {'id': 3, 'prediction': []}}
[{1: 2}, {1: 3}]
{'id': 0, 'feature': 63, 'rightChild': {'id': 1, 'feature': 9, 'leftChild': {'id': 2, 'prediction': []}, 'rightChild': {'id': 3, 'prediction': []}}}
[{1: 2}, {1: 3}]
{'id': 0, 'feature': 63, 'rightChild': {'id': 1, 'feature': 0, 'leftChild': {'id': 2, 'prediction': []}, 'rightChild': {'id': 3, 'prediction': []}}}
[{1: 2}, {1: 3}]
{'id': 0, 'feature': 26, 'rightChild': {'id': 1, 'feature': 9, 'leftChild': {'id': 2, 'prediction': []}, 'rightChild': {'id': 3, 'prediction': []}}}
[{1: 2}, {1: 3}]
{'id':

In [11]:
filename = 'forests/adult/text/RF_10.json'
f = open(filename, 'r')
transactions = json.load(f)
f.close()

patterns = j

embeddingCounts = list()

for pattern in patterns:
    counts = 0
    for transaction in transactions:
        mappings = allEmbeddings(pattern, transaction)
        counts += len(mappings)
    embeddingCounts.append(counts)
print(embeddingCounts)

[15, 17, 10, 18, 15, 12, 20, 10, 14, 12, 14, 13, 16, 23]


In [12]:
print(patterns[0])

{'id': 0, 'feature': 61, 'leftChild': {'id': 1, 'feature': 63, 'leftChild': {'id': 2, 'prediction': []}}, 'rightChild': {'id': 3, 'prediction': []}}


In [13]:
allEmbeddings(patterns[0], transactions[10])

[{0: 205, 1: 206, 2: 207, 3: 211}, {0: 620, 1: 621, 2: 622, 3: 628}]

In [14]:
filename = 'forests/rootedFrequentTrees/adult/WithLeafEdges/RF_10_t10.patterns'
f = open(filename)

# gives us the patterns of the selected size
frequentPatterns = filter(lambda line: line.count('(') == patternSize, f)

# gives us only the canonical strings of the patterns
patternCountsFromAlg = list(map(lambda fp: int(fp.split('\t')[0]), frequentPatterns))

print(patternCountsFromAlg)

f.close()

[13, 14, 10, 15, 11, 11, 15, 10, 10, 10, 11, 10, 10, 13]


## We want to output to embeddings that are found in the transaction forest

I have written cString2json.py that converts from my format to some json format that contains an object 

In [21]:
#!/usr/bin/env python3

'''Transform the canonical string format that is given by the lwg and lwgr 
programs to a json format that is compatible to the format of Sebastian.
reads from stdin and prints to stdout

usage: cString2json.py leq|eq patternSize < patternFile > jsonFile

leq results in all patterns up to patternSize vertices being converted,
eq results in all patterns of exactly patternSize vertices being converted.'''


import sys


def cString2json(cString):
	'''Pascals canonical string format and the json format used in Dortmund are 
	basically identical (up to ordering, symbols, and general feeling of course ;) ).
	This is a converter that transforms a single tree from cString format to json format 
	(entirely by string maipulation).'''
	
	intermediate = cString.replace('( leftChild', ',"leftChild":{').replace('( rightChild', ',"rightChild":{').replace(')', '}').replace('leaf', '-1 "prediction":[]')
	tokens = intermediate.split(' ')
	
	json = ''
	i = 0
	for t in tokens:
		try:
			feature = int(t)
			if feature != -1:
			    s = '"id":' + str(i) + ',"feature":' + t
			else:
			    s = '"id":' + str(i) + ','
			json += s
			i += 1
		except ValueError:
			json += t
		    
		    
	return ('{' + json.rstrip() + '}')


def parseCStringFileFixedSizePatterns(fIn,  patternSize):
	'''Select the patterns with patternSize vertices from the file f
	with filename. f is assumed to be in the format that lwg or lwgr 
	uses to store the frequent patterns.'''

	# here, we count the number of edges in the pattern
	patternSize = patternSize - 1 

	# gives us the patterns of the selected size
	frequentPatterns = filter(lambda line: line.count('(') == patternSize, fIn)

	# splits the strings into fields
	tokens = map(lambda fp: fp.split('\t'), frequentPatterns)

	# gives us only the canonical strings of the patterns and their id
	pairs = map(lambda t: (t[1], t[2]), tokens)

	# transform to json strings
	jsonCStrings = map(lambda pair: '{"patternid":' + pair[0] + ',"pattern":' + cString2json(pair[1]) + '}', pairs)

	# if your memory explodes, feel free to change this line and the output mode of this function
	jsonBlob = '[' + ',\n'.join(jsonCStrings) + ']'

	return jsonBlob


def parseCStringFileUpToSizePatterns(fIn, patternSize):
	'''Select the patterns up to patternSize vertices from the file f
	with filename. f is assumed to be in the format that lwg or lwgr 
	uses to store the frequent patterns.'''

	# here, we count the number of edges in the pattern
	patternSize = patternSize - 1 

	# gives us the patterns of the selected size
	frequentPatterns = filter(lambda line: line.count('(') <= patternSize, fIn)

	# splits the strings into fields
	tokens = map(lambda fp: fp.split('\t'), frequentPatterns)

	# gives us only the canonical strings of the patterns and their id
	pairs = map(lambda t: (t[1], t[2]), tokens)

	# transform to json strings
	jsonCStrings = map(lambda pair: '{"patternid":' + pair[0] + ',"pattern":' + cString2json(pair[1]) + '}', pairs)

	# if your memory explodes, feel free to change this line and the output mode of this function
	jsonBlob = '[' + ',\n'.join(jsonCStrings) + ']'

	return jsonBlob


# if __name__ == '__main__':
# 	if len(sys.argv) != 3:
# 		sys.stderr.write('You need exactly two arguments: first leq or eq, second an integer.\n')
# 		sys.exit(1)
# 	else:
# 		try:
# 			knownFlag = False
# 			if sys.argv[1] == 'leq': 
# 				result = parseCStringFileUpToSizePatterns(sys.stdin, int(sys.argv[2]))
# 				knownFlag = True
# 			if sys.argv[1] == 'eq': 
# 				result = parseCStringFileFixedSizePatterns(sys.stdin, int(sys.argv[2]))
# 				knownFlag = True
			
# 			if not knownFlag:
# 				sys.stderr.write('First argument must be either leq or eq.\n')
# 				sys.exit(1)
			
# 			sys.stdout.write(result)
# 			sys.exit(0)

# 		except ValueError:
# 			sys.stderr.write('Second argument must be an integer.\n')
# 			sys.exit(1)

In [78]:
def recCheckEmbedding(pattern, transaction, mapping):
    # check if we are in a leaf vertex in both pattern and transaction
    if 'prediction' in pattern.keys() and 'prediction' in transaction.keys():
        mapping[pattern['id']] = transaction['id']
        return True
    
    # check if we are in a split vertex in both pattern and transaction
    if 'feature' in pattern.keys() and 'feature' in transaction.keys():

        # check if split features match
        if pattern['feature'] == transaction['feature']:
            
            foundLeft = True
            foundRight = True
            if 'leftChild' in pattern.keys():
                if 'leftChild' in transaction.keys():
                    foundLeft = recCheckEmbedding(pattern['leftChild'], transaction['leftChild'], mapping)    
                else:
                    foundLeft = False
                
            if 'rightChild' in pattern.keys():
                if 'rightChild' in transaction.keys():
                    foundRight = recCheckEmbedding(pattern['rightChild'], transaction['rightChild'], mapping)
                else:
                    foundRight = False
                
            if foundLeft and foundRight:
                mapping[pattern['id']] = transaction['id']
                return True
            else:
                return False
            
    # if we are in the mixed case split vertex vs. leaf vertex then we cannot map the vertices on each other
    return False


def checkForEmbedding(pattern, transaction):
    '''For two given root vertices, check whether pattern is a rooted 
    subtree of transaction such that the roots map to each other 
    and return a mapping id->id if so, o/w None'''
    
    mapping = dict()
    if recCheckEmbedding(pattern, transaction, mapping):
        return mapping
    else:
        return None
    
    
def findAllEmbeddings(pattern, patternid, transaction):
    '''Find all embeddings of pattern into transaction and store them in the transaction
    at the positions where the root vertex of the pattern maps to.
    This method expects to be called after initTransactionTreeForEmbeddingStorage()'''
    if 'feature' in transaction.keys():
        if 'leftChild' in transaction.keys():
            findAllEmbeddings(pattern, patternid, transaction['leftChild'])
        if 'rightChild' in transaction.keys():
            findAllEmbeddings(pattern, patternid, transaction['rightChild'])
    
    mapping = checkForEmbedding(pattern, transaction)
    if mapping != None:
        transaction['patterns'].append((patternid, mapping))
        print(transaction['patterns'])

def initTransactionTreeForEmbeddingStorage(transaction):
    '''We want to be able to store all matching patterns and their embeddings at the
    the vertices, where the root of the pattern maps to. Hence, we init some fields 
    in the transaction decision tree.'''

    if 'feature' in transaction.keys():
        if 'leftChild' in transaction.keys():
            initTransactionTreeForEmbeddingStorage(transaction['leftChild'])
        if 'rightChild' in transaction.keys():
            initTransactionTreeForEmbeddingStorage(transaction['rightChild'])
    
    transaction['patterns'] = list()


def loadAndProcess(patternInput, transactionInput, transactionOutput):
    transactions = json.load(transactionInput)
    patterns = json.load(patternInput)
    
    for transaction in transactions:
        initTransactionTreeForEmbeddingStorage(transaction)
        
    for transaction in transactions:
        for pattern in patterns:
            findAllEmbeddings(pattern['pattern'], pattern['patternid'], transaction)
            

            
    json.dump(transactions, transactionOutput)
    

In [25]:
patternIn = open('forests/rootedFrequentTrees/adult/WithLeafEdges/RF_10_t10.patterns', 'r')
patternJson = parseCStringFileFixedSizePatterns(patternIn, 4)
patternIn.close()

patternjsondump = open('testPatterns.json', 'w')
patternjsondump.write(patternJson)
patternjsondump.close()

In [77]:
transactionInFile = open('forests/adult/text/RF_10.json', 'r')
patternInFile = open('testPatterns.json', 'r')

transactionOutFile = open('testTransactions.json', 'w')

loadAndProcess(patternInFile, transactionInFile, transactionOutFile)

transactionInFile.close()
patternInFile.close()
transactionOutFile.close()

[(337, {2: 141, 3: 142, 1: 140, 0: 139})]
[(337, {2: 224, 3: 225, 1: 223, 0: 222})]
[(337, {2: 141, 3: 142, 1: 140, 0: 139}), (336, {2: 141, 1: 140, 3: 143, 0: 139})]
[(338, {2: 369, 3: 370, 1: 368, 0: 358})]
[(335, {2: 364, 3: 365, 1: 363, 0: 361})]
[(328, {2: 446, 3: 447, 1: 445, 0: 437})]
[(337, {2: 177, 3: 178, 1: 176, 0: 175})]
[(337, {2: 177, 3: 178, 1: 176, 0: 175}), (336, {2: 177, 1: 176, 3: 179, 0: 175})]
[(327, {1: 570, 3: 573, 2: 571, 0: 569})]
[(328, {2: 468, 3: 469, 1: 467, 0: 463})]
[(327, {1: 570, 3: 573, 2: 571, 0: 569}), (328, {2: 572, 3: 573, 1: 571, 0: 569})]
[(326, {2: 618, 3: 619, 1: 617, 0: 609})]
[(330, {2: 273, 3: 274, 1: 272, 0: 268})]
[(330, {2: 335, 3: 336, 1: 334, 0: 332})]
[(329, {2: 34, 3: 35, 1: 33, 0: 29})]
[(329, {2: 97, 3: 98, 1: 96, 0: 92})]
[(327, {1: 436, 3: 441, 2: 437, 0: 435})]
[(337, {2: 514, 3: 515, 1: 513, 0: 512})]
[(337, {2: 514, 3: 515, 1: 513, 0: 512}), (336, {2: 514, 1: 513, 3: 516, 0: 512})]
[(335, {2: 414, 3: 415, 1: 413, 0: 409})]
[(33

In [45]:
transactionOutFile = open('testTransactions.json', 'r')
result = json.load(transactionOutFile)
transactionOutFile.close()

print(json.dumps(result, indent=2))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## Conclusion
We have now the tools to transform the data and compute all embeddings of frequent patterns explicitly. 
This will be soon put in a nice script that is usable from the command line.

As a fist quick glance, it seems that the patterns that were identified as frequent mostly have only one embedding per random decision tree. 
But this needs to be investigated more closely.