# Event Data Filtering
*by: Sebastiaan J. van Zelst*

Like any data-driven field, the successful application of process mining needs *data munging and crunching*.
In pm4py, you can munge and crunch your data in two ways, i.e., you can write ```lambda``` functions and apply them on
your event log, or, you can apply pre-built filtering and transformation functions.
Hence, in this turtorial, we briefly explain how to filter event data in various different ways in pm4py.

## Generic Lambda Functions

In a nutshell, a lambda function allows you to specify a function that needs to be applied on a given element.
As a simple example, consider the following snippet:

In [59]:
f = lambda x: 2 * x
f(5)

10

In the code, we assign a ```lambda``` function to variable ```f```.
The function specifies that on each possible input it receives, the resulting function that is applied is a multiplication by 2.
Hence ```f(1)=2```, ```f(2)=4```, etc.

Note that, invoking ```f``` only works if we provide an argument that can be combined with the ```* 2``` operation.
For example, for ```strings```, the ```* 2``` operation concatenates the input argument with itself:

In [60]:
f('Pete')

'PetePete'

## Filter and Map

Lambda functions allow us to write short, type-independent functions.
Given a list of objects, Python provides two core functions that can apply a given lambda function on each element of
the given list (in fact, any iterable):

- ```filter(f,l)```
 - apply the given lambda function ```f``` as a filter on the iterable ```l```.
- ```map(f,l)```
  - apply the given lambda function ```f``` as a transformation on the iterable ```l```.

For more information, study the concept of ‘higher order functions’ in Python, e.g., as introduced [here](https://www.codespeedy.com/higher-order-functions-in-python-map-filter-sorted-reduce/).
Let's consider a few simple examples.

In [61]:
l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
filter(lambda n: n >= 5, l)

<filter at 0x1fc59e9a4f0>

The previous example needs little to no explanation, i.e., the filter retains all numbers in the list greater or equal to five.
However, what is interesting, is the fact that the resulting objects are not a list (or an iterables), rather a ```filter``` object.
Such an objects can be easily transformed to a list by wrapping it with a ```list()``` cast:

In [62]:
list(filter(lambda n: n >= 5, l))

[5, 6, 7, 8, 9, 10]

The same holds for the ```map()``` function:

In [63]:
map(lambda n: n * 3, l)

<map at 0x1fc59e7adf0>

In [64]:
list(map(lambda n: n * 3, l))

[3, 6, 9, 12, 15, 18, 21, 24, 27, 30]

Observe that, the previous map function simply muliplies each element of list ```l``` by three.

## Lambda-Based Filtering in pm4py

In pm4py, event log objects mimic lists of traces, which in turn, mimic lists of events.
Clearly, ```lambda``` functions can therefore be applied to event logs and traces.
However, as we have shown in the previous example, after applying such a lamda-based filter, the resulting object is no longer an event log.
Furthermore, casting a filter object or map object to an event log in ```pm4py``` is a bit more involved, i.e., it is
not so trivial as ```list(filter(...))``` in the previous example.
This is due to the fact that various meta-data is stored in the event log object as well.
To this end, pm4py offers wrapper functions that make sure that after applying your higher-order function with a lambda function,
the resulting object is again an Event Log object.
In the upcoming scripts, we'll take a look at some lambda-based fitlering.
First, let's inspect the length of each trace in our running example log by applying a generic map function

In [65]:
import pm4py

log = pm4py.read_xes('data/running_example.xes')
# inspect the length of each trace using a generic map function
list(map(lambda t: len(t), log))

parsing log, completed traces ::   0%|          | 0/6 [00:00<?, ?it/s]

[9, 5, 5, 5, 13, 5]

As we can see, there are four traces describing a trace of length 5, one trace of length 9 and one trace of length 13.
Let's retain all traces that have a lenght greater than 5.

In [66]:
lf = pm4py.filter_log(lambda t: len(t) > 5, log)
list(map(lambda t: len(t), lf))

[9, 13]

The traces of length 9 and 13 have repeated behavior in them, i.e., the *reinitiate request* activity has been performed at least once:

In [67]:
list(map(lambda t: (len(t), len(list(filter(lambda e: e['concept:name'] == 'reinitiate request', t)))), log))

[(9, 1), (5, 0), (5, 0), (5, 0), (13, 2), (5, 0)]

Observe that the map function maps each trace onto a tuple.
The first element describes the length of the trace.
The second element describes the number of occurrences of the activity *register request*.
Observe that we obtain said counter by filtering the trace, i.e., by retaining only those events that describe the
*reinitiate request* activity and counting the length of the resulting list.
Note that the traces describe a list of events, and, events are implementing a dictionary.
In this case, the activity name is captured by the ```concept:name``` attribute.

In general, PM4PY supports the following *generic filtering functions*:

- ```pm4py.filter_log(f, log)```
  - filter the log according to a function ```f```.
- ```pm4py.filter_trace(f,trace)```
  - filter the trace according to function ```f```.
-  ```pm4py.sort_log(log, key, reverse)```
  - sort the event log according to a given ```key```, reversed order if ```reverse==True```.
- ```pm4py.sort_trace(trace, key, reverse)```
  - sort the trace according to a given ```key```, reversed order if ```reverse==True```.

Let's see these functions in action:

In [68]:
print(len(log))
lf = pm4py.filter_log(lambda t: len(t) > 5, log)
print(len(lf))

6
2


In [69]:
print(len(log[0]))  #log[0] fetches the 1st trace
tf = pm4py.filter_trace(lambda e: e['concept:name'] in {'register request', 'pay compensation'}, log[0])
print(len(tf))

9
2


In [70]:
print(len(log[0]))
ls = pm4py.sort_log(log, lambda t: len(t))
print(len(ls[0]))
ls = pm4py.sort_log(log, lambda t: len(t), reverse=True)
print(len(ls[0]))

9
5
13


## Specific Filters

There are various pre-built filters in PM4Py, which make commonly needed process mining filtering functionality a lot easier.
In the upcoming overview, we briefly give present these functions.
We describe how to call them, their main input parameters and their return objects.
Note that, all of the filters work on both DataFrames and pm4py event log objects.

### Start Activities
- ```filter_start_activities(log, activities, retain=True)```
  - retains (or drops) the traces that contain the given activity as the final event.

In [71]:
pm4py.filter_start_activities(log, {'register request'})

[{'attributes': {'concept:name': '3'}, 'events': [{'concept:name': 'register request', 'org:resource': 'Pete', 'time:timestamp': datetime.datetime(2010, 12, 30, 14, 32, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'Activity': 'register request', 'Resource': 'Pete', 'Costs': '50'}, '..', {'concept:name': 'pay compensation', 'org:resource': 'Ellen', 'time:timestamp': datetime.datetime(2011, 1, 15, 10, 45, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'Activity': 'pay compensation', 'Resource': 'Ellen', 'Costs': '200'}]}, '....', {'attributes': {'concept:name': '4'}, 'events': [{'concept:name': 'register request', 'org:resource': 'Pete', 'time:timestamp': datetime.datetime(2011, 1, 6, 15, 2, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600))), 'Activity': 'register request', 'Resource': 'Pete', 'Costs': '50'}, '..', {'concept:name': 'reject request', 'org:resource': 'Ellen', 'time:timestamp': datetime.datetime(2011, 1, 12, 15, 44, tzinfo=datetime.tim

In [72]:
pm4py.filter_start_activities(log, {'register request TYPO!'})

[]

In [73]:
import pandas

ldf = pm4py.format_dataframe(pandas.read_csv('data/running_example.csv', sep=';'), case_id='case_id',
                             activity_key='activity', timestamp_key='timestamp')
pm4py.filter_start_activities(ldf, {'register request'})

Unnamed: 0,case:concept:name,concept:name,time:timestamp,costs,org:resource,@@index
14,1,register request,2010-12-30 10:02:00+00:00,50,Pete,14
15,1,examine thoroughly,2010-12-31 09:06:00+00:00,400,Sue,15
16,1,check ticket,2011-01-05 14:12:00+00:00,100,Mike,16
17,1,decide,2011-01-06 10:18:00+00:00,200,Sara,17
18,1,reject request,2011-01-07 13:24:00+00:00,200,Pete,18
9,2,register request,2010-12-30 10:32:00+00:00,50,Mike,9
10,2,check ticket,2010-12-30 11:12:00+00:00,100,Mike,10
11,2,examine casually,2010-12-30 13:16:00+00:00,400,Sean,11
12,2,decide,2011-01-05 10:22:00+00:00,200,Sara,12
13,2,pay compensation,2011-01-08 11:05:00+00:00,200,Ellen,13


In [74]:
pm4py.filter_start_activities(ldf, {'register request TYPO!'})

Unnamed: 0,case:concept:name,concept:name,time:timestamp,costs,org:resource,@@index


### End Activities
- ```filter_end_activities(log, activities, retain=True)```
  - retains (or drops) the traces that contain the given activity as the final event.

For example, we can retain the number of cases that end with a "payment of the compensation":

In [75]:
len(pm4py.filter_end_activities(log, 'pay compensation'))

3

### Event Attribute Values

- ```filter_event_attribute_values(log, attribute_key, values, level="case", retain=True)```
  - retains (or drops) traces (or events) based on a given collection of ```values``` that need to be matched for the
  given ```attribute_key```. If ```level=='case'```, complete traces are matched (or dropped if ```retain==False```) that
  have at least one event that describes a specifeid value for the given attribute. If ```level=='event'```, only events
  that match are retained (or dropped).

In [87]:
# retain any case that has either Peter or Mike working on it
lf = pm4py.filter_event_attribute_values(log, 'org:resource', {'Pete', 'Mike'})
list(map(lambda t: list(map(lambda e: e['org:resource'], t)), lf))

[['Pete', 'Mike', 'Ellen', 'Sara', 'Sara', 'Sean', 'Pete', 'Sara', 'Ellen'],
 ['Mike', 'Mike', 'Sean', 'Sara', 'Ellen'],
 ['Pete', 'Sue', 'Mike', 'Sara', 'Pete'],
 ['Mike', 'Ellen', 'Mike', 'Sara', 'Mike'],
 ['Ellen',
  'Mike',
  'Pete',
  'Sara',
  'Sara',
  'Ellen',
  'Mike',
  'Sara',
  'Sara',
  'Sue',
  'Pete',
  'Sara',
  'Mike'],
 ['Pete', 'Mike', 'Sean', 'Sara', 'Ellen']]

In [88]:
# retain only those events that have Pete or Mik working on it
lf = pm4py.filter_event_attribute_values(log, 'org:resource', {'Pete', 'Mike'}, level='event')
list(map(lambda t: list(map(lambda e: e['org:resource'], t)), lf))


[['Pete', 'Mike', 'Pete'],
 ['Mike', 'Mike'],
 ['Pete', 'Mike', 'Pete'],
 ['Mike', 'Mike', 'Mike'],
 ['Mike', 'Pete', 'Mike', 'Pete', 'Mike'],
 ['Pete', 'Mike']]