# Generate many synthetic datasets with observeable attribute change

All generated datasets have 
- 5 relevant attributes (attributes with a change)
- 5 irrelevant attributes (attributes without a change)
- primary change-points at 9 locations (1/10ths, 2/10ths...) (depends on base dataset)

All datasets are generated for sizes 2500, 5000 and 7500 and 10000.

Dataset 1: The simple one
- 3 attribute values each
- only sudden drift
- type of change: new attribute value
- standard_deviation_offset_explain_change_point = 0

Dataset 2: The complex one
- 30 attribute values each
- sudden and re-occurring drift
- type of change mixed
- standard_deviation_offset_explain_change_point = 10

In [6]:
import helper
import uuid
import os
import numpy as np

In [7]:
input_datasets = [
    {
        'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cb\\cb2.5k.xes',
        'size': 2500,
        'change_points': helper.get_change_points_maardji_et_al_2013(2500)
    },
    # {
    #     'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cb\\cb5k.xes',
    #     'size': 5000
    # },
    # {
    #     'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cb\\cb7.5k.xes',
    #     'size': 7500
    # },
    # {
    #     'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cf\\cf10k.xes',
    #     'size': 10000,
    #     'change_points': helper.get_change_points_maardji_et_al_2013(10000)
    # }
]


In [8]:
count_relevant_attributes = 5
count_irrelevant_attributes = 5

generations_per_dataset = 100

output_folder = 'data\\synthetic\\attribute_drift\\' # the resulting file will be put in the subfolder 'configuration/size/old_file_name_UUID.xes'

## Generate Dataset 1: The simple one

In [9]:
number_attribute_values = 3
type_of_drift = 'sudden'
type_of_change = 'mixed'
configuration_name = 'simple'

In [10]:
for dataset in input_datasets:
    print(f'Now working on dataset {dataset}')

    dataset_base = '.'.join(os.path.basename(dataset['path']).split('.')[:-1])
    
    for i in range(generations_per_dataset):
        print(f'{i + 1} of {generations_per_dataset} for current dataset')
        file_name = f'{dataset_base + "_" +str(uuid.uuid4())}.xes'
        output_path = os.path.join(output_folder, configuration_name, str(dataset['size']), file_name)
        helper.add_synthetic_attributes(dataset['path'],
            output_path,
            dataset['change_points'],
            count_relevant_attributes,
            count_irrelevant_attributes,
            number_attribute_values,
            type_of_drift,
            type_of_change)

Now working on dataset {'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cb\\cb2.5k.xes', 'size': 2500, 'change_points': [250, 500, 750, 1000, 1250, 1500, 1750, 2000, 2250]}
1 of 100 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (9483.192626953125 msec.)

2 of 100 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (8013.608154296875 msec.)

3 of 100 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (9202.968505859375 msec.)

4 of 100 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (8183.5703125 msec.)

5 of 100 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (8039.

## Generate Dataset 1: The complex one
- 30 attribute values each
- sudden and re-occurring drift
- type of change mixed

In [11]:
number_attribute_values = 30
types_of_drift = ['sudden', 'reoccuring']
type_of_change = 'mixed'
configuration_name = 'complex'

In [12]:
for dataset in input_datasets:
    print(f'Now working on dataset {dataset}')

    dataset_base = '.'.join(os.path.basename(dataset['path']).split('.')[:-1])
    
    for i in range(generations_per_dataset):
        print(f'{i + 1} of {generations_per_dataset} for current dataset')
        type_of_drift = np.random.choice(types_of_drift)
        file_name = f'{dataset_base + "_" +str(uuid.uuid4())}.xes'
        output_path = os.path.join(output_folder, configuration_name, str(dataset['size']), file_name)
        helper.add_synthetic_attributes(dataset['path'],
            output_path,
            dataset['change_points'],
            count_relevant_attributes,
            count_irrelevant_attributes,
            number_attribute_values,
            type_of_drift,
            type_of_change)

Now working on dataset {'path': 'data\\synthetic\\maardji et al 2013_xes\\logs\\cb\\cb2.5k.xes', 'size': 2500, 'change_points': [250, 500, 750, 1000, 1250, 1500, 1750, 2000, 2250]}
1 of 100 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (9304.8505859375 msec.)

2 of 100 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (7524.8076171875 msec.)

3 of 100 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (7531.823486328125 msec.)

4 of 100 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (7545.97705078125 msec.)

5 of 100 for current dataset
Importance: DEBUG
Message: Start serializing log to XES.XML

Importance: DEBUG
Message: finished serializing log (7525.