# The α algorithm

A [Petri net](https://en.wikipedia.org/wiki/Petri_net) is a graph-based process model with nodes of two types: places and transitions. Places are containers for tokens that may contain zero, one, or more tokens. Transitions are activities. A transition is enabled if all its directly preceding (input) places have at least one token. The transition fires by consuming exactly one token from each input place and producing exactly one token to each output place.

Worfklow net (WF-net) is a Petri net with the designated start and end places and made of a single [connected component](https://en.wikipedia.org/wiki/Component_(graph_theory)). WF-nets are devoted to model business processes. For WF-net, a business case begins when a token is put into the start place. For sound (see Lecture 2) WF-net the process ends when the token appears in the end place. For unsound WF-nets, it may happen that the appearance of the token in the end place does not end the process.

The α algorithm is a very simple algorithm for discovering WF-nets from event logs. See Lecture 6 for the details on the α algorithm.

## Preliminaries

Install PM4Py package and download event logs for exercises.

In [None]:
# !pip install pm4py
# !pip install pyvis
# !wget http://www.cs.put.poznan.pl/tpawlak/files/EP/RoadTrafficFineManagement.xes.gz

# from google.colab import data_table
# data_table.enable_dataframe_formatter()

# from google.colab import files

In [None]:
from IPython.display import display, IFrame, Markdown
from pm4py.algo.filtering.log.start_activities import start_activities_filter
from pm4py.algo.filtering.log.end_activities import end_activities_filter
from pm4py.statistics.end_activities.log import get as end_activities_get
from pm4py.algo.filtering.log.start_activities import start_activities_filter

import base64

import pandas as pd
import pm4py

def view_html(filename: str):
  display(IFrame(src="data:text/html;base64," + base64.b64encode(open(filename, "rb").read()).decode("ascii"), width=1000, height=1000))

def printmd(string):
  display(Markdown(string))

## Exercises

Import the `RoadTrafficFineManagement.xes.gz` event log.

In [None]:
log = pm4py.read_xes("Sepsis.xes.gz", variant="iterparse20")

# integer_features = ["article", "points"]
# for feature in integer_features:
#   log[feature] = log[feature].astype("Int32")

# nominal_features = ["org:resource", "article", "dismissal", "vehicleClass", "notificationType", "lastSent"]
# for feature in nominal_features:
#   log[feature] = log[feature].astype("string")

log[:10000]

Analyze the event log and answer the questions:
* How many traces and events are in the event log?
561470, 150370
* What are the activities in the event log?
'Create Fine', 'Send Fine', 'Insert Fine Notification',
       'Add penalty', 'Send for Credit Collection', 'Payment',
       'Insert Date Appeal to Prefecture', 'Send Appeal to Prefecture',
       'Receive Result Appeal from Prefecture',
       'Notify Result Appeal to Offender', 'Appeal to Judge'
* What is the time window for the event log?
Timestamp('2000-01-01 00:00:00+0000', tz='UTC')-Timestamp('2013-06-18 00:00:00+0000', tz='UTC')
* Which activities start and end the process?
start: Create Fine, end: Send Fine, Payment
* What resources are involved in individual activities?
561	Create Fine
* What is the distribution of fine amount?
prawoskośny

In [None]:
# n_traces = # TODO: calculate number of traces
# print(f"# traces: {n_traces}")

# n_events = # TODO: calculate number of events
# print(f"# events: {n_events} ({float(n_events)/n_traces} per trace)")

# timestamp_attribute = "" # TODO: set the name of the attribute holding time information
# beginning = log[timestamp_attribute].min()
# end = log[timestamp_attribute].max()
# print(f"Time window: {beginning} - {end} ({end - beginning})")

activity_attribute = "concept:name" # TODO: set the name of the attribute holding the activity name
activities = log.groupby(activity_attribute)[activity_attribute].count()
printmd("**Acitivites:**")
display(activities)

# start_activities = pm4py... # TODO: find start activities
# end_activities = pm4py... # TODO: find end activities
# printmd("**Start activities:**")
# print(", ".join(start_activities.keys()))
# printmd("**End activities:**")
# print(", ".join(end_activities.keys()))

In [None]:
log_df = log.sort_values(by=["case:concept:name", "time:timestamp"])

# Get first event per case
first_events = log_df.groupby("case:concept:name").first().reset_index()

# Get last event per case
last_events = log_df.groupby("case:concept:name").last().reset_index()

# Filter cases where first activity is "ER Registration"
cases_start = first_events[first_events["concept:name"] == "ER Registration"]["case:concept:name"]

# Filter cases where last activity is e.g. "Discharge" (replace with your desired end activity)
cases_end = last_events[last_events["concept:name"].isin(["Release A", "Release B", "Release C", "Release D", "Release E", "Return ER"])]["case:concept:name"]

# Keep only cases where both conditions hold
cases_to_keep = set(cases_start).intersection(set(cases_end))

# Filter the original dataframe to only keep those cases
filtered_log_df = log_df[log_df["case:concept:name"].isin(cases_to_keep)].copy()
len(filtered_log_df)

How do look like the most common process variants?
**Draw on a piece of paper the general structure of the process that reproduces at least 90% of the most common behavior.**

Do people usually pay fines or not?

In [None]:
variants = pm4py.get_variants(filtered_log_df)
len(variants)

In [None]:
common_variants = pd.DataFrame(variants.items(), columns=["variant", "count"]).sort_values("count", ascending=False)
common_variants["percentile"] = common_variants["count"] * 100 / common_variants["count"].sum()
common_variants

In [None]:
pd.set_option('display.max_colwidth', None)
print(common_variants['variant'].head(30))
pd.set_option('display.max_colwidth', 30)

In [None]:
k = 34
common_variants[0:k]['percentile'].sum()

In [None]:
display(common_variants[0:k])

log_top_k = pm4py.filter_variants_top_k(filtered_log_df, k) # TODO: filter top k variants

wfnet_inductive = pm4py.discover_petri_net_inductive(log_top_k) # TODO: discover WF-net using the Inductive Miner
pm4py.vis.view_petri_net(*wfnet_inductive)

from pm4py.algo.analysis.woflan import algorithm as woflan

woflan_parameters = {
    woflan.Parameters.RETURN_ASAP_WHEN_NOT_SOUND: False,
    woflan.Parameters.PRINT_DIAGNOSTICS: True,
    woflan.Parameters.RETURN_DIAGNOSTICS: False
}

is_sound = woflan.apply(*wfnet_inductive, parameters=woflan_parameters) # TODO: check soundness
print(f"Is sound: {is_sound}")

Pick the model that you feel is the best and replay the log on it to add frequences of events/activities, and temporal information.

In [None]:
from pm4py.visualization.petri_net import visualizer as pn_visualizer

net = wfnet_inductive # TODO: set WF-net that you feel the best

# add frequencies
parameters_freq = {pn_visualizer.Variants.FREQUENCY.value.Parameters.FORMAT: "png"}
gviz_freq = pn_visualizer.apply(*net, parameters=parameters_freq, variant=pn_visualizer.Variants.FREQUENCY, log=log)
pn_visualizer.view(gviz_freq)

# add temporal information
parameters_temp = {pn_visualizer.Variants.PERFORMANCE.value.Parameters.FORMAT: "png"}
gviz_temp = pn_visualizer.apply(*net, parameters=parameters_temp, variant=pn_visualizer.Variants.PERFORMANCE, log=log)
pn_visualizer.view(gviz_temp)

Calculate fitness, precision, generalization, and simplicity for the best WF-net.

In [None]:
from pm4py.algo.evaluation.generalization import algorithm as generalization_evaluator
from pm4py.algo.evaluation.simplicity import algorithm as simplicity_evaluator

net = wfnet_inductive

fitness = pm4py.fitness_alignments(log, *net) # TODO: calculate fitness using alignments
precision = pm4py.precision_alignments(log, *net) # TODO: calculate precision using alignments
generalization = generalization_evaluator.token_based.apply(log, *net) # TODO: calculate generalization using token-based replay
simplicity = simplicity_evaluator.apply(net[0]) # TODO: calculate simplicity; what kind of measure is it?

In [None]:
print(f"Fitness: {fitness}")
print(f"Precision: {precision:.3f}")  # 22 % przypadków nie ma w logu
print(f"Generalization: {generalization:.3f}")
print(f"Simplicity: {simplicity:.3f}")

Replay each trace on this model using the alignment algorithm and visualize the result of conformance check.

In [None]:
alignments = pm4py... # TODO: calculate alignments of the log and model
pm4py.vis.view_alignments(log, alignments)

Explain decisions in the best WF-net using decision trees induced by replaying the event log on the WF-net.
For performance reasons sample at most 2000 traces from the entire event log.

Interpret the resulting decision trees in the context of the decision points.

In [None]:
from pm4py.algo.decision_mining import algorithm as decision_mining
from pm4py.visualization.petri_net import visualizer
from pm4py.visualization.decisiontree import visualizer as tree_visualizer
from sklearn import tree

net =  # TODO: set the WF-net that you feel the best

# View the WF-net
pm4py.view_petri_net(*net)
# View the identifiers of the nodes in the WF-net
gviz = visualizer.apply(*net, parameters={visualizer.Variants.WO_DECORATION.value.Parameters.DEBUG: True})
visualizer.view(gviz)

# Sample the event log for performance reasons
log_sample = pm4py... # TODO: sample 2000 traces from the event log

decision_points = decision_mining... # TODO: get decision points
for point in sorted(decision_points.keys()):
  X, y, classes = decision_mining.apply(log_sample, *net, decision_point=point)
  X = X.fillna(0)
  X = X[~y.isna()]
  y = y[~y.isna()]
  classes = [str(c) for c in classes]

  print(f"Decision point: {point}; possible decisions: {classes}")

  dt = tree.DecisionTreeClassifier(max_depth=10, ccp_alpha=0.005)
  dt = dt.fit(X, y)
  feature_names = list(X.columns.values.tolist())

  gviz = tree_visualizer.apply(dt, feature_names, classes)
  tree_visualizer.view(gviz)