# Synthetic Dataset Generation
*Author: Lennart Ebert (mail@lennart-ebert.de)*
</br></br>

This notebook is a pre-configured attribute generator which creates the dataset used in the thesis.

It also offers pre-configurations for two additional datasets with recurring drift and 10 attribute values.

All generated datasets have 
- 5 relevant attributes (attributes with a change)
- 5 irrelevant attributes (attributes without a change)
- primary change-points at 9 locations (1/10ths, 2/10ths...) (depends on base dataset)
- standard_deviation_offset_explain_change_point = 0

Dataset 1: sudden_3_attribute_values (used in the thesis)
- 3 attribute values each
- only sudden drift

Dataset 2: recurring_3_attribute_values
- 3 attribute values each
- only recurring drift

Dataset 3: sudden_10_attribute_values
- 3 attribute values each
- only sudden drift

In [7]:
import helper
import uuid
import os
import numpy as np

Specify the input datasets which are to be augmented with drifting attributes.

In [8]:
input_datasets = [
    # {
    #     'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cb\\cb2.5k.xes',
    #     'size': 2500,
    #     'change_points': helper.get_change_points_maardji_et_al_2013(2500)
    # },
    # {
    #     'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cb\\cb5k.xes',
    #     'size': 5000
    # },
    # {
    #     'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cb\\cb7.5k.xes',
    #     'size': 7500
    # },
    {
        'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cf\\cf10k.xes',
        'size': 10000,
        'change_points': helper.get_change_points_maardji_et_al_2013(10000)
    }
]


In [9]:
count_relevant_attributes = 5
count_irrelevant_attributes = 5

generations_per_dataset = 1

output_folder = 'data\\synthetic\\generated_datasets\\' # the resulting file will be put in the subfolder 'configuration/size/old_file_name_UUID.xes'

## Generate Dataset 1: sudden_3_attribute_values

In [10]:
number_attribute_values = 3
type_of_drift = 'sudden'
type_of_change = 'mixed'
configuration_name = 'sudden_3_attribute_values'

In [11]:
for dataset in input_datasets:
    print(f'Now working on dataset {dataset}')

    dataset_base = '.'.join(os.path.basename(dataset['path']).split('.')[:-1])
    
    for i in range(generations_per_dataset):
        print(f'{i + 1} of {generations_per_dataset} for current dataset')
        file_name = f'{dataset_base + "_" +str(uuid.uuid4())}.xes'
        output_path = os.path.join(output_folder, configuration_name, str(dataset['size']), file_name)
        helper.add_synthetic_attributes(dataset['path'],
            output_path,
            dataset['change_points'],
            count_relevant_attributes,
            count_irrelevant_attributes,
            number_attribute_values,
            type_of_drift,
            type_of_change)

Now working on dataset {'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cf\\cf10k.xes', 'size': 10000, 'change_points': [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000]}
1 of 1 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (36351.96630859375 msec.)



## Generate Dataset 2: recurring_3_attribute_values

In [12]:
number_attribute_values = 3
type_of_drift = 'reoccurring'
type_of_change = 'mixed'
configuration_name = 'recurring_3_attribute_values'

In [13]:
for dataset in input_datasets:
    print(f'Now working on dataset {dataset}')

    dataset_base = '.'.join(os.path.basename(dataset['path']).split('.')[:-1])
    
    for i in range(generations_per_dataset):
        print(f'{i + 1} of {generations_per_dataset} for current dataset')
        file_name = f'{dataset_base + "_" +str(uuid.uuid4())}.xes'
        output_path = os.path.join(output_folder, configuration_name, str(dataset['size']), file_name)
        helper.add_synthetic_attributes(dataset['path'],
            output_path,
            dataset['change_points'],
            count_relevant_attributes,
            count_irrelevant_attributes,
            number_attribute_values,
            type_of_drift,
            type_of_change)

Now working on dataset {'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cf\\cf10k.xes', 'size': 10000, 'change_points': [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000]}
1 of 1 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (36913.90625 msec.)



## Generate Dataset 3: sudden_10_attribute_values

In [14]:
number_attribute_values = 10
type_of_drift = 'sudden'
type_of_change = 'mixed'
configuration_name = 'sudden_10_attribute_values'

In [15]:
for dataset in input_datasets:
    print(f'Now working on dataset {dataset}')

    dataset_base = '.'.join(os.path.basename(dataset['path']).split('.')[:-1])
    
    for i in range(generations_per_dataset):
        print(f'{i + 1} of {generations_per_dataset} for current dataset')
        file_name = f'{dataset_base + "_" +str(uuid.uuid4())}.xes'
        output_path = os.path.join(output_folder, configuration_name, str(dataset['size']), file_name)
        helper.add_synthetic_attributes(dataset['path'],
            output_path,
            dataset['change_points'],
            count_relevant_attributes,
            count_irrelevant_attributes,
            number_attribute_values,
            type_of_drift,
            type_of_change)

Now working on dataset {'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cf\\cf10k.xes', 'size': 10000, 'change_points': [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000]}
1 of 1 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (37397.5029296875 msec.)

