# Import and Preprocess Data
*Author: Lennart Ebert (mail@lennart-ebert.de)*
</br></br>

This Jupyter notebook helps with importing the datasets used in the master's thesis. Detailed instructions are provided in each section of the notebook.

There are a total of four datasets:
1. The synthetic Maaradji et al. 2013 dataset.
2. The BPIC 2015 dataset.
3. The BPIC 2018 dataset.
4. The generated synthetic datasets with attribute drift.


If all datasets are imported, the resulting folder structure should look like this:

```
.data
|-real
|--bpic_2015
|---BPIC15_1.xes
|---...
|--bpic_2018
|---BPI Challenge 2018.xes
|-synthetic
|--maardji et al 2013_mxml
|---logs...
|--maardji et al 2013_xes
|---logs...
|--generated_datasets
|---recurring_3_attribute_values...
|---sudden_3_attribute_values...
|---sudden_10_attribute_values...
```

In [3]:
# imports
import os
import opyenxes
from opyenxes.data_in import XMxmlParser
from opyenxes.data_out import XesXmlSerializer
import helper
import zipfile
import shutil
import gzip

## Synthetic Maaradji et al. 2013 Dataset

The dataset has 72 event logs with different kinds of concept drift.

Instructions:
1. Download from https://data.4tu.nl/articles/dataset/Business_Process_Drift/12712436
2. Place the file at `data/synthetic/maaradji et al 2013.zip` or change the path specified below.
3. Execute cells below.
4. The extracted files should be placed in `"data/synthetic/maardji et al 2015_mxml"`

This script does:
- unpack the data
- convert the data from MXML to XES

In [2]:
# change the from path to your download location
zip_path = 'data/synthetic/maaradji et al 2013.zip'

#### 1. Unpack the data

In [3]:
# unpacks the data into the correct folder
unzipped_path = 'data/synthetic/maardji et al 2013_mxml/'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(unzipped_path)

#### 2. Convert to XES

In [4]:
# set the from and to path
mxml_path = unzipped_path
xes_path = 'data/synthetic/maardji et al 2013_xes/'

In [5]:
# replicate the folder that has the MXML files but convert them into .XES files
def all_mxml_to_xes(mxml_path, xes_path):
    # get all mxml files from the path
    all_files = set()
    for (dirpath, dirnames, filenames) in os.walk(mxml_path):
        all_files.update([os.path.normpath(os.path.join(dirpath, file)) for file in filenames])
    all_mxml_files = set({file for file in all_files if file.lower().endswith('mxml')})
    all_none_mxml_files = all_files - all_mxml_files
    # print(all_none_mxml_files)
    
    # copy all none mxml files without changing them
    for none_mxml_file in all_none_mxml_files:
        new_path = helper.create_and_get_new_path(none_mxml_file, mxml_path, xes_path)
        
        # copy content 
        shutil.copy2(none_mxml_file, new_path)
    
    # change all mxml to xes
    xes_log_file_paths = []
    for mxml_log_path in all_mxml_files:
        new_path = helper.create_and_get_new_path(mxml_log_path, mxml_path, xes_path, new_extension='.xes')
        xes_log_file_paths.append(new_path)
        
        print(new_path)
        # continue

        # read the mxml log into OpyenXes
        mxml_parser = XMxmlParser.XMxmlParser()
        parsed_logs = None
        with open(mxml_log_path) as mxml_log_file:
            parsed_logs = mxml_parser.parse(mxml_log_file)
        
        # Our mxml files always only contain one log. Therefore, access this log
        parsed_log = parsed_logs[0]
        
        # write the XES log out
        with open(new_path, 'w') as new_file:
            # save log back to XES file
            XesXmlSerializer.XesXmlSerializer().serialize(parsed_log, new_file)
    
    return xes_log_file_paths

In [6]:
xes_log_file_paths = all_mxml_to_xes(mxml_path, xes_path)

data\synthetic\maardji et al 2013_xes\logs\cm\cm7.5k.xes
data\synthetic\maardji et al 2013_xes\logs\sw\sw10k.xes
data\synthetic\maardji et al 2013_xes\logs\OIR\OIR7.5k.xes
data\synthetic\maardji et al 2013_xes\logs\re\re10k.xes
data\synthetic\maardji et al 2013_xes\logs\pl\pl2.5k.xes
data\synthetic\maardji et al 2013_xes\logs\cb\cb7.5k.xes
data\synthetic\maardji et al 2013_xes\logs\cb\cb5k.xes
data\synthetic\maardji et al 2013_xes\logs\sw\sw2.5k.xes
data\synthetic\maardji et al 2013_xes\logs\RIO\RIO5k.xes
data\synthetic\maardji et al 2013_xes\logs\rp\rp5k.xes
data\synthetic\maardji et al 2013_xes\logs\IOR\IOR5k.xes
data\synthetic\maardji et al 2013_xes\logs\fr\fr7.5k.xes
data\synthetic\maardji et al 2013_xes\logs\cb\cb10k.xes
data\synthetic\maardji et al 2013_xes\logs\ORI\ORI5k.xes
data\synthetic\maardji et al 2013_xes\logs\OIR\OIR5k.xes
data\synthetic\maardji et al 2013_xes\logs\fr\fr10k.xes
data\synthetic\maardji et al 2013_xes\logs\ROI\ROI10k.xes
data\synthetic\maardji et al 2013_xe

In [7]:
xes_log_file_paths

['data\\synthetic\\maardji et al 2013_xes\\logs\\cm\\cm7.5k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\sw\\sw10k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\OIR\\OIR7.5k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\re\\re10k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\pl\\pl2.5k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\cb\\cb7.5k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\cb\\cb5k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\sw\\sw2.5k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\RIO\\RIO5k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\rp\\rp5k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\IOR\\IOR5k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\fr\\fr7.5k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\cb\\cb10k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\ORI\\ORI5k.xes',
 'data\\synthetic\\maardji et al 2013_xes\\logs\\OIR\\OIR5k.xes',
 'data\\synthet

## BPIC 2015 Dataset
The dataset consists of 5 event logs which need to be downloaded individually. The XES event log files are available at https://data.4tu.nl/collections/_/5065424/1.

Place them into the folder `data/real/bpic_2015` as follows:
bpic_2015
|- BPIC15_1.xes
|- BPIC15_2.xes
|- BPIC15_3.xes
|- BPIC15_4.xes
|- BPIC15_5.xes

## BPIC 2018 Dataset
Download the dataset from https://data.4tu.nl/articles/dataset/BPI_Challenge_2018/12688355/1 and place it into the folder `data/real/bpic_2018/`.

Set the file name to "BPI Challenge 2018.xes.gz" or change the variable below.

Run the next cell to unpack the dataset.

In [None]:
file_path = 'data/real/bpic_2018/'
file_name = 'BPI Challenge 2018.xes.gz'

file_out_name = 'BPI Challenge 2018.xes'

with gzip.open(file_path+file_name, 'rb') as f_in:
    with open(file_path+file_out_name, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

## Generated Synthetic Datasets

Either re-run the generation scipt in notebook "11_add synthetic attribute data_many_datasets.ipynb" or download the datasets at https://drive.google.com/drive/folders/1dh-a0aUjZt9eGlVYpA5_cEzFtEpQosVp.

The datasets are provided as .zip files. Unpack the files into the folder `data/synthetic/generated_datasets/`.