# An Introduction to Process Mining :Trade Payables Use Case

Most companies have information systems that record activities of interests, such as the registration of a new customer, the sale of a product, the approval of a purchase syatem, the processing of a payment system, etc. All of these activities result in one or more events being recorded in some information system. These events are usually used for record-keeping, accounting, auditing, etc.

Process mining is concerned with using these recorded activities in order to understand how an organisation works. Using process mining, actual sequence of tasks (events) that are performed can be automatically discovered, revealing the behaviour of the recorded process execution. It is therefore possible to compare the actual process with the expected behaviour and deviations can be detected. This can lead to identification of process diagnostics and preventive action for potential risks and fraud. To learn more about process mining, visit XXXXXX.

Trade payables are obligations by a company to pay for goods or services that have been acquired from suppliers in the ordinary course of business. Purchase-to-pay process is recognised as one of the most important processes within a company because it provides core resources for running a business on a daily basis and strongly influences overall costs and timing of production. It starts with filing a purchase order/request and is completed when the final payment is made to the vendor.

In a procurement process, there are different risks inherent such as; fictious transactions being recorded, payment has been made without an underlying purchase, purchases not properly authorised, etc. Therefore, when auditing trade payables, it is very important to understand how the purchase-to-pay transactions are processed, and the controls available in the process, such as segregation of duties.

In this post, we will look at how process mining can be used to understand the purchase-to-pay process of a company, who is responsible for carrying out which tasks and how the tasks are handed over from one employee to another. This post is an introduction to the possibile use case of process mining in auditing. Further posts will touch on more use cases even outside the field of auditing.

This will be done using python and various libraries such as pandas (for analysing the data) and graphviz (for drawing the directly followed graph showing the process).

The dataset was gotten from https://github.com/IBM/processmining. IBM Github repository for process mining.


The event log is fully IEEE-XES compliant and is structured as follows. The case ID is a combination of the purchase document and the purchase item. There is a total of 76,349 purchase documents containing in total 251,734 items, i.e. there are 251,734 cases. In these cases, there are 1,595,923 events relating to 42 activities performed by 627 users (607 human users and 20 batch users). Sometimes the user field is empty, or NONE, which indicates no user was recorded in the source system.

For each purchase item (or case) the following attributes are recorded:

- 1 Key: The purchase ID,
- 2 Date: The date and time of an event,
- 3 User: The user resource involved in the process,
- 4 Activity: The activity performed in the process,
- 5 Product_hierarchy: A text explaining the hierachy of a purchase item,
- 6 NetValue: The value of a purchase item,
- 7 Delivery: The delivery ID of this item,
- 8 Delivery_Date: The delivery date of this item,
- 9 Good_Issue_Date: The date goods was issued. However this was derived,
- 10 Difference: The time difference (in seconds) between the delivery date and goods issue date,
- 11 Customer: The customer id,
- 12 OrderType: Type of order,
- 13 clientCode: The client code,
- 14 NotInTime: Indicating if an order was delayed or not where 1 = delayed and 0 = on time,
- 15 Execution_Status: Indicating if it was a manual or automatic task,
- 16 User_Type: Indicating if the task was done by a human or robot,
- 17 Change_Status: Change indicator,
- 18 ID_Change_Status: The change_status ID,
- 19 Block_Status: Block indicator,
- 20 ID_Block_Status: The Block_Status ID,


In [1]:
import pandas as pd
import numpy as np
import graphviz
# import matplotlib.pyplot as plt
# import seaborn as sns

In [4]:
df = pd.read_csv("o2c_crypted.csv", thousands='.', decimal=',')
df

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,Key,Date,User,Activity,Role,Product_hierarchy,NetValue,Company,Delivery,Delivery_Date,...,Delayed,PromiseMAD,ActualMAD,Execution_Status,User_Type,Change_Status,ID_Change_Status,Block_Status,ID_Block_Status,Local_Family_code
0,7020029102_10,04/01/2016 13:46:13,User1,Line Creation,Customer Service Representative,TLC Optical Cables,773.87,767,,,...,IN TIME,1.452726e+12,1.452726e+12,Manual,Human,no change,With change,no block,With block,LocalFamily1
1,7020029103_10,04/01/2016 13:46:55,User1,Line Creation,Customer Service Representative,TLC Optical Cables,706.50,767,,,...,IN TIME,1.482102e+12,1.452812e+12,Manual,Human,no change,With change,no block,With block,LocalFamily2
2,7020029104_10,04/01/2016 13:47:30,User1,Line Creation,Customer Service Representative,TLC Optical Cables,2168.40,767,,,...,IN TIME,1.453417e+12,1.453417e+12,Manual,Human,no change,With change,no block,With block,LocalFamily2
3,7020029104_20,04/01/2016 13:47:38,User1,Line Creation,Customer Service Representative,TLC Optical Cables,1566.60,767,,,...,IN TIME,1.453417e+12,1.453417e+12,Manual,Human,no change,With change,no block,With block,LocalFamily3
4,7020029104_30,04/01/2016 13:47:43,User1,Line Creation,Customer Service Representative,TLC Optical Cables,1106.85,767,,,...,IN TIME,1.482102e+12,1.453417e+12,Manual,Human,no change,With change,no block,With block,LocalFamily2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251473,7020034883_90,11/07/2017 23:59:59,User66,Good Issue,Customer Service Representative,TLC Optical Cables,59994.48,767,7.050145e+09,02/08/17,...,IN TIME,1.500930e+12,1.500590e+12,Manual,Human,no change,With change,no block,With block,LocalFamily9
251474,7020034883_90,11/07/2017 23:59:59,User66,Good Issue,Customer Service Representative,TLC Optical Cables,59994.48,767,7.050145e+09,02/08/17,...,IN TIME,1.500930e+12,1.500590e+12,Manual,Human,no change,With change,no block,With block,LocalFamily9
251475,7020030338_150,12/07/2017 23:59:59,User66,Good Issue,Customer Service Representative,TLC Optical Cables,89136.00,767,7.050145e+09,07/05/18,...,IN TIME,1.525040e+12,1.501020e+12,Manual,Human,no change,With change,no block,With block,LocalFamily13
251476,7020033072_100,13/07/2017 23:59:59,User66,Good Issue,Customer Service Representative,TLC Optical Cables,17013.15,767,7.050145e+09,01/08/17,...,IN TIME,1.500930e+12,1.500930e+12,Manual,Human,no change,With change,no block,With block,LocalFamily8


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251478 entries, 0 to 251477
Data columns (total 26 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Key                251478 non-null  object 
 1   Date               251478 non-null  object 
 2   User               251478 non-null  object 
 3   Activity           251478 non-null  object 
 4   Role               251478 non-null  object 
 5   Product_hierarchy  251478 non-null  object 
 6   NetValue           251478 non-null  float64
 7   Company            251478 non-null  int64  
 8   Delivery           110283 non-null  float64
 9   Delivery_Date      110276 non-null  object 
 10  Good_Issue_Date    251478 non-null  float64
 11  Difference         110283 non-null  float64
 12  Customer           251478 non-null  object 
 13  OrderType          251478 non-null  object 
 14  clientCode         251478 non-null  object 
 15  NotInTime          251478 non-null  int64  
 16  De

In [6]:
df['Key'].nunique()

45825