# Deep Learning Toolkit for Splunk - Process Mining with PM4Py

This notebook contains a barebone example workflow how to work on custom containerized code that seamlessly runs in Splunk Enterprise and interfaces with the Deep Learning Toolkit for Splunk.

## Stage 0 - import libraries
At stage 0 we define all imports necessary to run our subsequent code depending on various libraries.

In [2]:
import json
import numpy as np
import pandas as pd
import pm4py
from pm4py.objects.log.util import dataframe_utils
from pm4py.objects.conversion.log import converter as log_converter
from pm4py.algo.discovery.alpha import algorithm as alpha_miner
from pm4py.algo.discovery.inductive import algorithm as inductive_miner
from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
from pm4py.visualization.petrinet import visualizer as pn_visualizer

# ...
# global constants
MODEL_DIRECTORY = "/srv/app/model/data/"

In [3]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
print("numpy version: " + np.__version__)
print("pandas version: " + pd.__version__)
print("pm4py version: " + pm4py.__version__)

numpy version: 1.18.1
pandas version: 1.1.5
pm4py version: 2.1.2


## Stage 1 - get a data sample from Splunk
In Splunk run a search to pipe a dataset into your notebook environment. Note: mode=stage is used in the | compute command to do this.

index=_internal uri=* user=* <br>
| stats count by _time uri user <br>
| eval start_timestamp=strftime(_time, "%Y%m%dT%H%M%S") <br>
| rename uri as case:concept:name user as concept:name <br>
| eval time:timestamp = start_timestamp<br>
| fit MLTKContainer algo=process_mining mode=stage case:concept:name,concept:name,start_timestamp,time:timestamp into process_mining<br>

After you run this search your data set sample is available as a csv inside the container to develop your model. The name is taken from the model_name value or set to "default" if no model_name is present. This step is intended to work with a subset of your data to create your custom model.

In [6]:
# this cell is not executed and should only be used for staging data into the notebook environment to have it accessible in this notebook
def stage(name):
    with open("data/"+name+".csv", 'r') as f:
        df = pd.read_csv(f)
    with open("data/"+name+".json", 'r') as f:
        param = json.load(f)
    return df, param

In [7]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
df, param = stage("process_mining")
print(param)

{'options': {'params': {'algo': 'process_mining', 'mode': 'stage'}, 'args': ['case:concept:name', 'concept:name', 'start_timestamp', 'time:timestamp'], 'feature_variables': ['case:concept:name', 'concept:name', 'start_timestamp', 'time:timestamp'], 'model_name': 'process_mining', 'algo_name': 'MLTKContainer', 'mlspl_limits': {'disabled': False, 'handle_new_cat': 'default', 'max_distinct_cat_values': '10000', 'max_distinct_cat_values_for_classifiers': '10000', 'max_distinct_cat_values_for_scoring': '10000', 'max_fit_time': '6000', 'max_inputs': '100000000', 'max_memory_usage_mb': '4000', 'max_model_size_mb': '150', 'max_score_time': '6000', 'streaming_apply': '0', 'use_sampling': '1'}, 'kfold_cv': None}, 'feature_variables': ['case:concept:name', 'concept:name', 'start_timestamp', 'time:timestamp']}


In [8]:
df

Unnamed: 0,case:concept:name,concept:name,start_timestamp,time:timestamp
0,/services/server/info,-,20201221T105423,20201221T105423
1,/favicon.ico,-,20201221T105423,20201221T105423
2,/robots.txt,-,20201221T105423,20201221T105423
3,/servicesNS/nobody/splunk_instrumentation/tele...,splunk-system-user,20201221T105426,20201221T105426
4,/services/server/info,splunk-system-user,20201221T105426,20201221T105426
...,...,...,...,...
2204,/en-GB/splunkd/__raw/services/search/shelper?o...,admin,20201221T110413,20201221T110413
2205,/en-GB/splunkd/__raw/services/search/shelper?o...,admin,20201221T110413,20201221T110413
2206,/en-GB/splunkd/__raw/services/search/shelper?o...,admin,20201221T110413,20201221T110413
2207,/en-GB/splunkd/__raw/services/search/shelper?o...,admin,20201221T110413,20201221T110413


## Stage 2 - create and initialize a model

In [9]:
# initialize your model
# available inputs: data and parameters
# returns the model object which will be used as a reference to call fit, apply and summary subsequently
def init(df,param):
    model = {}
    return model

In [10]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
model = init(df,param)
print(model)

{}


## Stage 3 - fit the model

In [11]:
# train your model
# returns a fit info json object and may modify the model object
def fit(model,df,param):
    info = {}
    return info

In [12]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
print(fit(model,df,param))

{}


## Stage 4 - apply the model

In [15]:
# apply your model
# returns the calculated results
def pngfile_to_base64(filepath):
    import base64
    import io
    with open(filepath, 'rb') as file:
        pic_hash = base64.b64encode(file.read())
    return str(pic_hash)

def apply(model,df,param):
    # convert dataframe to pm4py compatible event_log object
    log_csv = dataframe_utils.convert_timestamp_columns_in_df(df)
    log_csv = log_csv.sort_values('start_timestamp')
    event_logs = log_converter.apply(log_csv, variant=log_converter.Variants.TO_EVENT_LOG)

    # apply dfg discovery
    dfg, start_activities, end_activities = pm4py.discover_dfg(event_logs)
    #dfg = dfg_discovery.apply(event_logs, variant=dfg_discovery.Variants.PERFORMANCE)
    temp_viz_file = 'dfg.png'
    pm4py.save_vis_dfg(dfg, start_activities, end_activities, temp_viz_file, log=None)
    model['plot_pairplot'] = pngfile_to_base64(temp_viz_file)
        
    # apply inductive miner for petri net retrival
    net, initial_marking, final_marking = inductive_miner.apply(event_logs)
    temp_viz_file = 'petrinet.png'
    pm4py.save_vis_petri_net(net, initial_marking, final_marking, temp_viz_file)
    model['plot_matrix'] = pngfile_to_base64(temp_viz_file)

    # return a dot graphviz compatible description to feed the process diagram custom visualization on a splunk dashboard
    gviz = pn_visualizer.apply(net, initial_marking, final_marking, variant=pn_visualizer.Variants.FREQUENCY, log=log_csv)    
    return pd.DataFrame([str(gviz)], columns=['dot'])
    
    # Other options:
    # discover process tree
    #process_tree = pm4py.discover_tree_inductive(event_logs)
    #pm4py.save_vis_process_tree(process_tree, 'processtree.png')

    # add frequency information
    #dfg_frequency = dfg_discovery.apply(event_logs, variant=dfg_discovery.Variants.FREQUENCY)
    
    # get the raw arcs, e.g. of the petri net
    #return pd.DataFrame(list(net.arcs))


In [16]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
result = apply(model,df,param)
result

Unnamed: 0,dot
0,"digraph ""imdf_net_1608545197.571939"" {\n\tgrap..."


## Stage 5 - save the model

In [46]:
# save model to name in expected convention "algo_<model_name>"
def save(model,name):
    #with open(MODEL_DIRECTORY + name + ".json", 'w') as file:
    #    json.dump(model, file)
    return model

In [None]:
#saved_model = save(model,'algo_barebone_model')
#saved_model

## Stage 6 - load the model

In [None]:
# load model from name in expected convention "algo_<model_name>"
def load(name):
    model = {}
    #with open(MODEL_DIRECTORY + name + ".json", 'r') as file:
    #    model = json.load(file)
    return model

In [None]:
#loaded_model = load('algo_barebone_model')
#loaded_model

## Stage 7 - provide a summary of the model

In [8]:
# return a model summary
def summary(model=None):
    returns = {"version": {"pm4py": pm4py.__version__} }
    return returns

In [38]:
summary(model)

{'version': {'pm4py': '2.0.1.3'}}

## End of Stages
All subsequent cells are not tagged and can be used for further freeform code