# Unsupervised discretization
Dataset: clean_pendigits (L)

Updated at: 23 June 22

By: Sam

### About Dataset

pendigits.tra: Training	7494

pendigits.tes: Testing	3498
	
The way we used the dataset was to use first half of training for  actual training, one-fourth for validation and one-fourth for writer-dependent testing. The test set was used for writer-independent testing and is the actual quality measure.

Number of Attributes: 16 input + 1 class attribute (10 classes from 0-9)
The input vector size is 2xT, two times the number of points resampled. We considered spatial resampling to T=8,12,16 points in our experiments and found that T=8 gave the best trade-off between accuracy and complexity.

All attributes are numeric.

No missing value, balanced class

In [1]:
# Load library
import pandas as pd
import numpy as np
import time
import timeit

In [2]:
from sklearn.preprocessing import KBinsDiscretizer as kbins # also use for unsupervised

In [3]:
from feature_engine.discretisation import EqualFrequencyDiscretiser as efd
from feature_engine.discretisation import EqualWidthDiscretiser as ewd

In [4]:
# Load dataset
data = pd.read_csv('clean_pendigits.csv')

In [5]:
data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,class
0,47,100,27,81,57,37,26,0,0,23,56,53,100,90,40,98,8
1,0,89,27,100,42,75,29,45,15,15,37,0,69,2,100,6,2
2,0,57,31,68,72,90,100,100,76,75,50,51,28,25,16,0,1
3,0,100,7,92,5,68,19,45,86,34,100,45,74,23,67,0,4
4,0,67,49,83,100,100,81,80,60,60,40,40,33,20,47,0,1


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      10992 non-null  int64
 1   A2      10992 non-null  int64
 2   A3      10992 non-null  int64
 3   A4      10992 non-null  int64
 4   A5      10992 non-null  int64
 5   A6      10992 non-null  int64
 6   A7      10992 non-null  int64
 7   A8      10992 non-null  int64
 8   A9      10992 non-null  int64
 9   A10     10992 non-null  int64
 10  A11     10992 non-null  int64
 11  A12     10992 non-null  int64
 12  A13     10992 non-null  int64
 13  A14     10992 non-null  int64
 14  A15     10992 non-null  int64
 15  A16     10992 non-null  int64
 16  class   10992 non-null  int64
dtypes: int64(17)
memory usage: 1.4 MB


In [7]:
# Convert outcome to categorical
data['class'] = pd.Categorical(data['class'])

In [8]:
# get list of numeric attributes to discretize
num_col = data.select_dtypes(include=np.number).columns
num_col = num_col.tolist()

In [9]:
num_col

['A1',
 'A2',
 'A3',
 'A4',
 'A5',
 'A6',
 'A7',
 'A8',
 'A9',
 'A10',
 'A11',
 'A12',
 'A13',
 'A14',
 'A15',
 'A16']

## Equal Width Discretization

In [10]:
# Define function: Inputs: dataset, number of parameters

def ewd_disc(data, k):
    ## set up the discretisation transformer
    ewd_disc = ewd(bins=k, variables=num_col, return_boundaries=False)
    '''
    Parameters
    ----------
    bins : int, default=10
        Desired number of equal width intervals / bins.

    variables : list
        The list of numerical variables to transform. If None, the
        discretiser will automatically select all numerical type variables.

    return_object : bool, default=False
        Whether the numbers in the discrete variable should be returned as
        numeric or as object. The decision should be made by the user based on
        whether they would like to proceed the engineering of the variable as
        if it was numerical or categorical.

    return_boundaries: bool, default=False
        whether the output should be the interval boundaries. If True, it returns
        the interval boundaries. If False, it returns integers.
    '''
    ## fit the transformer
    ewd_disc.fit(data)
    ## transform the data
    data_ewd = ewd_disc.transform(data)
    ## binner_dict contains the boundaries of the different bins: 
    # stores the interval limits identified for each variable
    ewd_disc.binner_dict_
    return data_ewd  # return dataset after discretization

### EWD - Scenario 1: k = 4

In [11]:
# Perform discretization
k = 4
start = time.time() # Starting  time
data_ewd1 = ewd_disc(data, k)
end = time.time()
ewd_t = end - start
print("Discretization time, EWD, k = ", k,":",ewd_t) # Total time execution

Discretization time, EWD, k =  4 : 0.05168318748474121


In [12]:
# OUTPUT:
data_ewd1.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,class
0,1,3,1,3,2,1,1,0,0,0,2,2,3,3,1,3,8
1,0,3,1,3,1,2,1,1,0,0,1,0,2,0,3,0,2
2,0,2,1,2,2,3,3,3,3,2,1,2,1,0,0,0,1
3,0,3,0,3,0,2,0,1,3,1,3,1,2,0,2,0,4
4,0,2,1,3,3,3,3,3,2,2,1,1,1,0,1,0,1


In [13]:
## OUTPUT: Check number of instance in each interval in the data_ewd
# With equal width discretisation, each bin does not necessarily 
# contain the same number of observations.
for col in num_col:
    print(col)
    print(data_ewd1.groupby(col)[col].count())

A1
A1
0    4770
1    2511
2    1514
3    2197
Name: A1, dtype: int64
A2
A2
0      75
1     330
2    2177
3    8410
Name: A2, dtype: int64
A3
A3
0    3370
1    3717
2    2772
3    1133
Name: A3, dtype: int64
A4
A4
0     118
1     594
2    2567
3    7713
Name: A4, dtype: int64
A5
A5
0    3391
1    1841
2    2700
3    3060
Name: A5, dtype: int64
A6
A6
0    1094
1    1789
2    3492
3    4617
Name: A6, dtype: int64
A7
A7
0    2551
1    2546
2    3281
3    2614
Name: A7, dtype: int64
A8
A8
0    2963
1    3488
2    2683
3    1858
Name: A8, dtype: int64
A9
A9
0    2528
1    2141
2    2172
3    4151
Name: A9, dtype: int64
A10
A10
0    4669
1    3262
2    2247
3     814
Name: A10, dtype: int64
A11
A11
0    2877
1    1146
2    1691
3    5278
Name: A11, dtype: int64
A12
A12
0    4908
1    2781
2    2395
3     908
Name: A12, dtype: int64
A13
A13
0     930
1    3915
2    4075
3    2072
Name: A13, dtype: int64
A14
A14
0    5213
1    3175
2     543
3    2061
Name: A14, dtype: int64
A15
A15
0    4723
1

### EWD - Scenario 2: k = 7

In [14]:
# Perform discretization
k = 7
start = time.time() # Starting  time
data_ewd2 = ewd_disc(data, k)
end = time.time()
ewd_t = end - start
print("Discretization time, EWD, k = ", k,":", ewd_t) # Total time execution

Discretization time, EWD, k =  7 : 0.05597114562988281


In [15]:
# OUTPUT:
data_ewd2.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,class
0,3,6,1,5,3,2,1,0,0,1,3,3,6,6,2,6,8
1,0,6,1,6,2,5,2,3,1,1,2,0,4,0,6,0,2
2,0,3,2,4,5,6,6,6,5,5,3,3,1,1,1,0,1
3,0,6,0,6,0,4,1,3,6,2,6,3,5,1,4,0,4
4,0,4,3,5,6,6,5,5,4,4,2,2,2,1,3,0,1


In [16]:
## OUTPUT: Check number of instance in each interval in the data_ewd
# With equal width discretisation, each bin does not necessarily 
# contain the same number of observations.
for col in num_col:
    print(col)
    print(data_ewd2.groupby(col)[col].count())

A1
A1
0    3559
1    1540
2    1502
3    1204
4     784
5     685
6    1718
Name: A1, dtype: int64
A2
A2
0      43
1      47
2     152
3     444
4    1298
5    2627
6    6381
Name: A2, dtype: int64
A3
A3
0    2115
1    1644
2    2130
3    2242
4    1441
5     772
6     648
Name: A3, dtype: int64
A4
A4
0      57
1      91
2     259
3     758
4    1531
5    2186
6    6110
Name: A4, dtype: int64
A5
A5
0    2492
1    1112
2     933
3    1370
4    1631
5    1344
6    2110
Name: A5, dtype: int64
A6
A6
0     727
1     478
2     949
3    1491
4    1960
5    2524
6    2863
Name: A6, dtype: int64
A7
A7
0    1841
1     939
2    1412
3    1857
4    1839
5    1402
6    1702
Name: A7, dtype: int64
A8
A8
0    2128
1    1168
2    2150
3    1922
4    1528
5     763
6    1333
Name: A8, dtype: int64
A9
A9
0    1825
1     916
2    1186
3    1311
4    1251
5    1388
6    3115
Name: A9, dtype: int64
A10
A10
0    3506
1    1466
2    1867
3    1708
4    1145
5    1005
6     295
Name: A10, dtype: int64
A11
A11

### EWD - Scenario 3: k = 10

In [17]:
# Perform discretization
k = 10
start = time.time() # Starting time
data_ewd3 = ewd_disc(data, k)
end = time.time()
ewd_t = end - start
print("Discretization time, EWD, k = ", k,":", ewd_t) # Total time execution

Discretization time, EWD, k =  10 : 0.054937124252319336


In [18]:
# OUTPUT:
data_ewd3.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,class
0,4,9,2,8,5,3,2,0,0,2,5,5,9,8,3,9,8
1,0,8,2,9,4,7,2,4,1,1,3,0,6,0,9,0,2
2,0,5,3,6,7,8,9,9,7,7,4,5,2,2,1,0,1
3,0,9,0,9,0,6,1,4,8,3,9,4,7,2,6,0,4
4,0,6,4,8,9,9,8,7,5,5,3,3,3,1,4,0,1


In [19]:
## OUTPUT: Check number of instance in each interval in the data_ewd
# With equal width discretisation, each bin does not necessarily 
# contain the same number of observations.
for col in num_col:
    print(col)
    print(data_ewd3.groupby(col)[col].count())

A1
A1
0    3133
1    1083
2    1110
3    1075
4     880
5     712
6     558
7     501
8     468
9    1472
Name: A1, dtype: int64
A2
A2
0      36
1      21
2      44
3     112
4     192
5     440
6    1005
7    1646
8    2327
9    5169
Name: A2, dtype: int64
A3
A3
0    1792
1     973
2    1280
3    1539
4    1503
5    1460
6     955
7     650
8     328
9     512
Name: A3, dtype: int64
A4
A4
0      38
1      45
2      83
3     189
4     357
5     729
6    1120
7    1555
8    1352
9    5524
Name: A4, dtype: int64
A5
A5
0    2185
1     797
2     738
3     685
4     827
5     998
6    1206
7     989
8     846
9    1721
Name: A5, dtype: int64
A6
A6
0     618
1     293
2     402
3     667
4     903
5    1120
6    1426
7    1915
8    1427
9    2221
Name: A6, dtype: int64
A7
A7
0    1560
1     667
2     729
3     998
4    1143
5    1320
6    1354
7    1104
8     756
9    1361
Name: A7, dtype: int64
A8
A8
0    1947
1     572
2    1002
3    1645
4    1285
5    1324
6    1054
7     582
8     477
9

## Equal Frequency Discretization - EFD
- Reference: https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/EqualFrequencyDiscretiser.ipynb
- Parameter:
- q : int, default=10
    Desired number of equal frequency intervals / bins. In other words the
    number of quantiles in which the variables should be divided.

- variables : list
    The list of numerical variables that will be discretised. If None, the
    EqualFrequencyDiscretiser() will select all numerical variables.

- return_object : bool, default=False
    Whether the numbers in the discrete variable should be returned as
    numeric or as object. The decision is made by the user based on
    whether they would like to proceed the engineering of the variable as
    if it was numerical or categorical.

- return_boundaries: bool, default=False
    whether the output should be the interval boundaries. If True, it returns
    the interval boundaries. If False, it returns integers.

In [20]:
def efd_disc(data, k):
    ## set up the discretisation transformer
    efd_disc = efd(q=k, variables=num_col)
    ## fit the transformer
    efd_disc.fit(data)
    ## transform the data
    data_efd = efd_disc.transform(data)
    ## binner_dict_ stores the interval limits identified for each variable.
    efd_disc.binner_dict_
    return data_efd

### Define function efd_disc, inputs include dataset, number of intervals (k)

### EFD - Scenario 1: k = 4

In [21]:
# Perform discretization
k = 4
start = time.time() # Starting time
data_efd1 = efd_disc(data, k)
end = time.time()
efd_t = end - start
print("Discretization time, EFD, k = ", k,":", efd_t) # Total time execution

Discretization time, EFD, k =  4 : 0.06898617744445801


In [22]:
## OUTPUT: Check number of instance in each interval 
for col in num_col:
    print(col)
    print(data_efd1.groupby(col)[col].count())

A1
A1
0    2791
1    2786
2    2708
3    2707
Name: A1, dtype: int64
A2
A2
0    2754
1    2808
2    5430
Name: A2, dtype: int64
A3
A3
0    2765
1    2819
2    2688
3    2720
Name: A3, dtype: int64
A4
A4
0    2817
1    2766
2    5409
Name: A4, dtype: int64
A5
A5
0    2811
1    2700
2    2738
3    2743
Name: A5, dtype: int64
A6
A6
0    2778
1    2827
2    2651
3    2736
Name: A6, dtype: int64
A7
A7
0    2780
1    2716
2    2770
3    2726
Name: A7, dtype: int64
A8
A8
0    2767
1    2852
2    2672
3    2701
Name: A8, dtype: int64
A9
A9
0    2822
1    2718
2    2747
3    2705
Name: A9, dtype: int64
A10
A10
0    2819
1    2755
2    2729
3    2689
Name: A10, dtype: int64
A11
A11
0    2764
1    2740
2    2793
3    2695
Name: A11, dtype: int64
A12
A12
0    2754
1    2774
2    2733
3    2731
Name: A12, dtype: int64
A13
A13
0    2874
1    2795
2    2597
3    2726
Name: A13, dtype: int64
A14
A14
0    2757
1    2857
2    2634
3    2744
Name: A14, dtype: int64
A15
A15
0    5540
1    5452
Name: A15, 

### EFD - Scenario 2: k = 7

In [23]:
# Perform discretization
k = 7
start = time.time() # Starting time
data_efd2 = efd_disc(data, k)
end = time.time()
efd_t = end - start
print("Discretization time, EFD, k = ", k,":", efd_t) # Total time execution

Discretization time, EFD, k =  7 : 0.08504486083984375


In [24]:
## OUTPUT
data_efd2.info()
## OUTPUT: Check number of instance in each interval in the data_efd
for col in num_col:
    print(col)
    print(data_efd2.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   A1      10992 non-null  int64   
 1   A2      10992 non-null  int64   
 2   A3      10992 non-null  int64   
 3   A4      10992 non-null  int64   
 4   A5      10992 non-null  int64   
 5   A6      10992 non-null  int64   
 6   A7      10992 non-null  int64   
 7   A8      10992 non-null  int64   
 8   A9      10992 non-null  int64   
 9   A10     10992 non-null  int64   
 10  A11     10992 non-null  int64   
 11  A12     10992 non-null  int64   
 12  A13     10992 non-null  int64   
 13  A14     10992 non-null  int64   
 14  A15     10992 non-null  int64   
 15  A16     10992 non-null  int64   
 16  class   10992 non-null  category
dtypes: category(1), int64(16)
memory usage: 1.4 MB
A1
A1
0    3232
1    1538
2    1529
3    1565
4    1570
5    1558
Name: A1, dtype: int64
A2
A2
0    1617
1    1679
2 

### Scenario 3: k = 10

In [25]:
# Perform discretization
k = 10
start = time.time() # Starting time
data_efd3 = efd_disc(data, k)
end = time.time()
efd_t = end - start
print("Discretization time, EFD, k = ", k,":", efd_t) # Total time execution

Discretization time, EFD, k =  10 : 0.06845211982727051


In [26]:
## OUTPUT
data_efd3.info()
## OUTPUT: Check number of instance in each interval in the data_efd
for col in num_col:
    print(col)
    print(data_efd3.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   A1      10992 non-null  int64   
 1   A2      10992 non-null  int64   
 2   A3      10992 non-null  int64   
 3   A4      10992 non-null  int64   
 4   A5      10992 non-null  int64   
 5   A6      10992 non-null  int64   
 6   A7      10992 non-null  int64   
 7   A8      10992 non-null  int64   
 8   A9      10992 non-null  int64   
 9   A10     10992 non-null  int64   
 10  A11     10992 non-null  int64   
 11  A12     10992 non-null  int64   
 12  A13     10992 non-null  int64   
 13  A14     10992 non-null  int64   
 14  A15     10992 non-null  int64   
 15  A16     10992 non-null  int64   
 16  class   10992 non-null  category
dtypes: category(1), int64(16)
memory usage: 1.4 MB
A1
A1
0    3337
1    1089
2    1151
3    1024
4    1150
5    1044
6    2197
Name: A1, dtype: int64
A2
A2
0    1177
1 

## Fixed Frequency Discretization - FFD

### Define function ffd_disc: modify input of function efd
Input include dataset, interval frequency (m)

In [27]:
def ffd_disc(data, m): # 
    n = len(data)
    ## set up the discretisation transformer
    ffd_disc = efd(q=round(n/m), variables=num_col) # number of bins = n/m
    ## fit the transformer
    ffd_disc.fit(data)
    ## transform the data
    data_ffd = ffd_disc.transform(data)
    ## binner_dict_ stores the interval limits identified for each variable.
    ffd_disc.binner_dict_
    return data_ffd

### FFD - Scenario 1: m = 10

In [28]:
# Perform discretization
m = 10
start = time.time() # Starting time
data_ffd1 = ffd_disc(data, m)
end = time.time()
ffd_t = end - start
print("Discretization time, FFD,  m = ", m, ":", ffd_t) # Total time execution

Discretization time, FFD,  m =  10 : 0.25850510597229004


In [29]:
## OUTPUT
data_ffd1.info()
## OUTPUT: Check number of instance in each interval
for col in num_col:
    print(col)
    print(data_ffd1.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   A1      10992 non-null  int64   
 1   A2      10992 non-null  int64   
 2   A3      10992 non-null  int64   
 3   A4      10992 non-null  int64   
 4   A5      10992 non-null  int64   
 5   A6      10992 non-null  int64   
 6   A7      10992 non-null  int64   
 7   A8      10992 non-null  int64   
 8   A9      10992 non-null  int64   
 9   A10     10992 non-null  int64   
 10  A11     10992 non-null  int64   
 11  A12     10992 non-null  int64   
 12  A13     10992 non-null  int64   
 13  A14     10992 non-null  int64   
 14  A15     10992 non-null  int64   
 15  A16     10992 non-null  int64   
 16  class   10992 non-null  category
dtypes: category(1), int64(16)
memory usage: 1.4 MB
A1
A1
0      2423
1        85
2        57
3        79
4        69
       ... 
107      35
108      45
109      34
110

### FFD - Scenario 1: m = 30

In [30]:
# Perform discretization
m = 30
start = time.time() # Starting time
data_ffd2 = ffd_disc(data, m)
end = time.time()
ffd_t = end - start
print("Discretization time, EFD, m = ", m, ":", ffd_t) # Total time execution

Discretization time, EFD, m =  30 : 0.13798117637634277


In [31]:
## OUTPUT
data_ffd2.info()
## OUTPUT: Check number of instance in each interval
for col in num_col:
    print(col)
    print(data_ffd2.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   A1      10992 non-null  int64   
 1   A2      10992 non-null  int64   
 2   A3      10992 non-null  int64   
 3   A4      10992 non-null  int64   
 4   A5      10992 non-null  int64   
 5   A6      10992 non-null  int64   
 6   A7      10992 non-null  int64   
 7   A8      10992 non-null  int64   
 8   A9      10992 non-null  int64   
 9   A10     10992 non-null  int64   
 10  A11     10992 non-null  int64   
 11  A12     10992 non-null  int64   
 12  A13     10992 non-null  int64   
 13  A14     10992 non-null  int64   
 14  A15     10992 non-null  int64   
 15  A16     10992 non-null  int64   
 16  class   10992 non-null  category
dtypes: category(1), int64(16)
memory usage: 1.4 MB
A1
A1
0      2423
1        85
2        57
3        79
4        69
       ... 
101      35
102      45
103      34
104

### FFD - Scenario 3: m = 60

In [32]:
# Perform discretization
m = 60
start = time.time() # Starting time
data_ffd3 = ffd_disc(data, m)
end = time.time()
ffd_t = end - start
print("Discretization time, FFD, m = ", m, ":", ffd_t) # Total time execution

Discretization time, FFD, m =  60 : 0.12297821044921875


In [33]:
## OUTPUT
data_ffd3.info()
## OUTPUT: Check number of instance in each interval
for col in num_col:
    print(col)
    print(data_ffd3.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   A1      10992 non-null  int64   
 1   A2      10992 non-null  int64   
 2   A3      10992 non-null  int64   
 3   A4      10992 non-null  int64   
 4   A5      10992 non-null  int64   
 5   A6      10992 non-null  int64   
 6   A7      10992 non-null  int64   
 7   A8      10992 non-null  int64   
 8   A9      10992 non-null  int64   
 9   A10     10992 non-null  int64   
 10  A11     10992 non-null  int64   
 11  A12     10992 non-null  int64   
 12  A13     10992 non-null  int64   
 13  A14     10992 non-null  int64   
 14  A15     10992 non-null  int64   
 15  A16     10992 non-null  int64   
 16  class   10992 non-null  category
dtypes: category(1), int64(16)
memory usage: 1.4 MB
A1
A1
0     2423
1       85
2       57
3       79
4       69
      ... 
87      77
88      74
89      35
90      79
9

#### FFD, m = 100

In [34]:
# Perform discretization
m = 100
start = time.time() # Starting time
data_ffd4 = ffd_disc(data, m)
end = time.time()
ffd_t = end - start
print("Discretization time, FFD, m = ", m, ":", ffd_t) # Total time execution

Discretization time, FFD, m =  100 : 0.09906888008117676


In [35]:
## OUTPUT
data_ffd4.info()

## OUTPUT: Check number of instance in each interval
for col in num_col:
    print(col)
    print(data_ffd4.groupby(col)[col].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   A1      10992 non-null  int64   
 1   A2      10992 non-null  int64   
 2   A3      10992 non-null  int64   
 3   A4      10992 non-null  int64   
 4   A5      10992 non-null  int64   
 5   A6      10992 non-null  int64   
 6   A7      10992 non-null  int64   
 7   A8      10992 non-null  int64   
 8   A9      10992 non-null  int64   
 9   A10     10992 non-null  int64   
 10  A11     10992 non-null  int64   
 11  A12     10992 non-null  int64   
 12  A13     10992 non-null  int64   
 13  A14     10992 non-null  int64   
 14  A15     10992 non-null  int64   
 15  A16     10992 non-null  int64   
 16  class   10992 non-null  category
dtypes: category(1), int64(16)
memory usage: 1.4 MB
A1
A1
0     2423
1       85
2      136
3       69
4      169
      ... 
66      86
67     109
68      74
69     114
7

### Export discretized datasets

In [36]:
# EWD datasets:
data_ewd1.to_csv('pendigits_ewd1.csv', index=False) # k=4
data_ewd2.to_csv('pendigits_ewd2.csv', index=False) # k=7
data_ewd3.to_csv('pendigits_ewd3.csv', index=False) # k=10

In [37]:
# EFD datasets:
data_efd1.to_csv('pendigits_efd1.csv', index=False) # k=4
data_efd2.to_csv('pendigits_efd2.csv', index=False) # k=7
data_efd3.to_csv('pendigits_efd3.csv', index=False) # k=10


In [38]:
# FFD datasets:
data_ffd1.to_csv('pendigits_ffd1.csv', index=False) # m=10
data_ffd2.to_csv('pendigits_ffd2.csv', index=False) # m=30
data_ffd3.to_csv('pendigits_ffd3.csv', index=False) # m=60
data_ffd4.to_csv('pendigits_ffd4.csv', index=False) # m=100