# Parsing the Sagger project data
Make sure the **SaggerFlow.articyu3d** file is in the `.\data\` subdirectory of current path.

In [1]:
import os
path = os.path.join('data','SaggerFlow.articyu3d')
if os.path.isfile(path):
    print('Success!')    
else:
    print('It is not, but feel free to look for it with this tiny widget.')
    import os, sys
    import ipywidgets as widgets
    from ipywidgets import Dropdown, Output, HBox,VBox, Button, Text
    gui = Output()
    gui.layout.border='1px solid black'
    
    filepath=''
    fo = open(r'C:\Users\Builder\Documents\skrypty\jupyter-notebooks\jupyter-parsers\data\SaggerFlow.articyu3d', 'r', encoding='utf-8')

    root=os.getcwd()
    dirs=[root]

    out = Output()
    t=widgets.Text(value=filepath,disabled=True)
    b=Button(description='enter')
    w = Dropdown(options=os.listdir(os.path.join(*dirs)))

    def onclick(b):
        if(b.description == 'enter'):
            try:
                dirs.append(w.value)
                w.options=os.listdir(os.path.join(*dirs))
            except:
                b.button_style='danger'
                t.value = 'something went wrong! please reload the applet.'
                b.description = r':('
                b.disabled = True
                with out:
                    out.clear_output()
                    display(HBox([t, b]))

        elif(b.description == 'select'):
            path = os.path.join(*dirs + [w.value])
            print(path)
            gui.clear_output()
            gui.outputs = tuple()
            print('Success!')    

    b.on_click(onclick)
    def on_change(change):
        if change['type'] == 'change' and change['name'] == 'value':
            b.description='enter' if (os.path.isdir(os.path.join(*dirs, change['new']))) else 'select'
            b.button_style='info'
    w.observe(on_change)

    with gui:
        display(HBox([w,b]))
    display(gui)
    


Success!


Since that file is pretty big, we're going to parse it line by line to avoid memoryleaks with the help of iterators. In other words `lines` is a **generator** object such that eeach `next(lines)` call would yield a new line that can be fed to some pipeline and processed:

In [2]:
lines = (line for line in open(path, encoding='utf-8'))
for _ in range(10):
    line = next(lines)
    print(line)


lines = (line for line in open(path, encoding='utf-8'))
max_line = sum([1 for _ in lines])

#this will take a moment:
print(max_line)


{

  "Settings": {

    "set_TextFormatter": "Plain",

    "set_IncludedNodes": "Settings, Project, GlobalVariables, ObjectDefinitions, Packages, ScriptMethods, Hierarchy, Assets",

    "set_Localization": "False",

    "set_UseScriptSupport": "True",

    "ExportVersion": "1.2",

    "ObjectDefinitionsHash": "69156DE3B5146583EACDB5A1427B03AC8244E99B55B0A26CD7B00D39264DA792",

    "ScriptFragmentsHash": "3215635178"

  },

29346773


In [None]:
lines = (line for line in open(path, encoding='utf-8'))
def simplest_line_parser(l):
    print(l)
def simple_line_parser(l):
    print(l.strip())

def process_many(f, batchsize):
    for _ in range(batchsize):
        f(next(lines))    
    print('')
        
process_many(simplest_line_parser,2)
process_many(simple_line_parser,8)

# execute this cell to get deeper into the file in order to understand it's structure

## File structure
The file structure resembles jsons, xmls, or simply dictionaries. In other words it's some kind of a tree – there's clearly one root node with it's child nodes of similar nature. It's not infinite though, so eventually we should encounter a leaf, that is a node without any child nodes. 

Makes sense, right? But what does a _node_ actually mean? I think we get the idea, so let's try to define that somehow. In order to do that we're going to write the first **grammar rule**. In order for that to be a *proper definition*, any expression following such rule should be a node and vice-versa. 
Also, it's going to be recursive:

#### Node definition
  $$\textbf{ N}\leftrightarrow  \langle\textbf{T} , \textbf{T}\rangle \lor \langle\textbf{T},[\textbf{N},\ldots]\rangle$$
  
Any expression **N** is a *node* if and only if it matches one of two following patterns:
1. It's a ordered pair of some *terms*, which might be considered as a key-value matching between some strings. Since that node doesn't have any children, then it's a leaf.
2. a *term* is paired with a list of *nodes*... these might also be a leaf or the current case type of nodes. 

##### Why bother? Wasn't that a waste of time?
Not at all. Seeing that only two cases need to be considered when parsing an expression might give us some idea about implementing our pipeline structure. Also, the second case clearly suggest that measuring the depth while traversing such nested structures might be in order. It's a clear indicator of child-parent relation between some of the visited nodes. That's handy.

## Algorithm overview
  Given the `*line [string]*` as an input, which we assume to be a *node*. We should determine which kind of node that might be. If it's a leaf, then simply return `{key:value}` dictionary from it. If it's not, then there's a list of nodes which we might nest into. Simply feed that as a input for that algorithm, leaving the hard work to the recursion. Simply return the `return value` when receiver and it should be done.
  
#### Implementation
Let's start by extracting all possible key-value pairs with their depth

In [15]:
class ParseError(Exception):
    def __init__(self, pos, msg, *args):
        self.pos = pos
        self.msg = msg
        self.args = args

    def __str__(self):
        return '%s at position %s' % (self.msg % self.args, self.pos)

class FileParser:
    _lock = False
    
    def __init__(self, path=None, autoFill = True):
        self.cache = {}
        self.path=path
        self.autoFill = autoFill
           
    @property
    def path(self):
        return self._path
    
    @path.setter
    def path(self, val):
        if not self._lock and val is not None:
            self._path = val
            lines=(line for line in open(path, encoding='utf-8'))
            self.feed=lines
        
            
    @property
    def feed(self):
        self.current_line += 1

        return self._feed
    
    @feed.setter
    def feed(self, val):
        self._lock = True
        self._feed = val
        self.depth = -1
        self.current_line = 0
        self.allkeys = set()

    @feed.deleter
    def feed(self):
        self._lock = False
        if self.autoFill:
            path=self.path
            self.path=path
        else:
            del self._feed
            
    @property
    def line(self):
        self._line = next(self.feed)
        return self._line
    @line.setter
    def line(self, skip):
        for _ in range(1, skip):
            self._line=next(self,feed)
        return self._line
    def find_keys(self, which_keys):
        import csv
        from datetime import datetime
        


        lines_gathered=[]
        old_depth = -1

        for line in self.feed:
            depth  = self.getDepth(line)
            data = self.trimLine(line)
            key=data[0]

            if any(k in key for k in which_keys) and (old_depth>depth and old_depth != -1):
                old_depth = depth
                name=str(key)
                prefix=datetime.now().strftime("%Y%m%d_%H%M")
                filename=prefix+name+'.csv'
                with open(filename, 'a', encoding='utf-8') as f:
                    mywriter = csv.writer(f, delimiter=',').writerows(lines_gathered)
                lines_gathered=[data]
            elif old_depth!=-1 and old_depth<=depth:
                lines_gathered=[data]
                
                
                
                    

    def process_data(self):
        import csv
        from datetime import datetime
        name='keyvalues_all'
        suffix=datetime.now().strftime("%Y%m%d_%H%M")
        filename=name+suffix+'.csv'
        with open(filename, 'a', encoding='utf-8') as f:
            f.write('key, value, depth\n') #writing the headers
            

            for line in self.feed:
                depth  = self.getDepth(line)
                data = self.trimLine(line)
                key=data[0]

                if any(ext in data for ext in '{ }'.split()):
                    continue

                try:
                    value=data[1]
                except:
                    value=data[0]

                self.allkeys.add(key)
                f.write(', '.join([key, value, str(depth)]))
                f.write('\n')
    
    def getDepth(self, line):
        return int((len(line)-len(line.lstrip()))/2)
        
    def trimLine(self, line):
        import re
        data = [re.sub(r'[^{}A-Za-z0-9_\s]+', '', l) for l in [l.strip() for l in line.split(":")]]
        return data
    
    @property
    def linecount(self):
        try:
            return self._linecount
        except:
            self._linecount = sum([1 for _ in self.feed])
            del self.feed
            return self._linecount
        
p = FileParser(path) 
    

Now we can extract all the keys available and then search for the interesting ones

In [None]:
# p.linecount
p.process_data()

import re
keys=[]
queries=['Tutorial', 'Database', 'Entry']
for k in p.allkeys:
    if any(q in k for q in queries):
        keys.append(k)
        
print(keys)

In [16]:
keysToLookFor = ['CharacterDatabaseEntry', 'LinkedDatabaseEntries', 'DatabaseEntry', 'CommanderNotebookEntry', 'DatabasePopup', 'DatabaseEntryReference', 'Tutorial', 'TutorialEntry', 'LinkToDatabseEntry', 'DatabaseCategory', 'DatabaseSubcategory', 'EntryName', 'EntryKey']
p.find_keys(keysToLookFor)

Sadly we got to the point where it doesn't seem to work yet. Basically it's really tidious due to size of the project with all of the unnecesary files. :/


##### bunch of old code that got abadonned @ some point:

In [86]:
parsed_lines = ([int((len(s)-len(s.lstrip()))/2), *[l.strip() for l in s.split(":")]] for s in lines)

for _ in range(10):
    line=(next(parsed_lines))
    try:
        trailing_spaces, key, val = line[0], line[1], line[2]
    except:
        trailing_spaces, key, val = line[0], '', line[1]
    
    key=key.strip()
    val=val.strip()
    print('{0} [{1}::{2}] '.format('.'+trailing_spaces*'.', key, val))

.. ["Project"::{] 
... ["Name"::"GolanWar",] 
... ["DetailName"::"",] 
... ["Guid"::"08b09661-5911-4dce-b501-5bb3670351c8",] 
... ["TechnicalName"::"Saggerflow"] 
.. [::},] 
.. ["GlobalVariables"::[] 
... [::{] 
.... ["Namespace"::"OLD_Political",] 
.... ["Description"::"Political Echosystem Status",] 


In [44]:
# p.getAllKes()


In [36]:
d=set()
len(p.allKeys)

AttributeError: 'parser' object has no attribute 'allKeys'