## Diskritisasi
Diskritisasi juga disebut binning, merubah atribut **numerik** menjadi atribut **kategorikal**. Biasanya digunakan untuk metode/model data mining yang tidak dapat menangani atribut numerik.

# Tugas 2
melakukan proses diskritisasi dengan *equal width* dan *equal frequency*

In [4]:
import pandas as pd
import numpy as np

In [5]:
#source data
dataset_url = "https://raw.githubusercontent.com/indyrajanuar/datamining/main/IRIS.csv"
#create dataframe
df = pd.read_csv(dataset_url)

In [6]:
#show first 15 rows
df.head(15)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [9]:
# CONSTAN SERIES
SEPAL_LENGTH_SERIES = df["sepal_length"]
SEPAL_WIDTH_SERIES = df["sepal_width"]
PETAL_LENGTH_SERIES = df["petal_length"]
PETAL_WIDTH_SERIES = df["petal_width"]

# Class Species Name 
IRIS_SPECIES = df["species"]

# Rentang Lebar Sama (Equal-Width Intervals)
Pendekatan diskritisasi yang paling sederhana adalah membagi rentang dari 
*X* menjadi *k* rentang dengan lebar sama (equal-width interval).

## Cut
Salah satu tools dari pandas adalah cut. Cut digunakan untuk menghitung equal-width.




### Sepal Width

In [13]:
# equal-width intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

sepal_width_ew_binning = pd.cut(SEPAL_WIDTH_SERIES, amount_of_binning, True, labels)
labelled_sepal_width_ew_binning = sepal_width_ew_binning.value_counts()
interval_sepal_width_ew_binning = pd.cut(SEPAL_WIDTH_SERIES, amount_of_binning, True).value_counts()

In [27]:
# dataframe of sepal-width and sepal category
df_sepal_width_ew = pd.DataFrame(pd.concat((SEPAL_WIDTH_SERIES, sepal_width_ew_binning, IRIS_SPECIES), axis=1))

In [29]:
# change columns name
df_sepal_width_ew.columns = ["sepal.width", "category", "species"]

In [30]:
df_sepal_width_ew

Unnamed: 0,sepal.width,category,species
0,3.5,lebar,Iris-setosa
1,3.0,lebar,Iris-setosa
2,3.2,lebar,Iris-setosa
3,3.1,lebar,Iris-setosa
4,3.6,lebar,Iris-setosa
...,...,...,...
145,3.0,lebar,Iris-virginica
146,2.5,sedikit_lebar,Iris-virginica
147,3.0,lebar,Iris-virginica
148,3.4,lebar,Iris-virginica


In [17]:
# equal-width intervals binning with label
labelled_sepal_width_ew_binning

lebar            88
sedikit_lebar    47
sangat_lebar     15
Name: sepal_width, dtype: int64

In [35]:
# equal-width intervals without label
interval_sepal_width_ew_binning

(2.8, 3.6]      88
(1.998, 2.8]    47
(3.6, 4.4]      15
Name: sepal_width, dtype: int64

### Petal Width

In [18]:
# equal-width intervals
labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_width_ew_binning = pd.cut(PETAL_WIDTH_SERIES, amount_of_binning, True, labels)
labelled_petal_width_ew_binning = petal_width_ew_binning.value_counts()
interval_petal_width_ew_binning = pd.cut(PETAL_WIDTH_SERIES, amount_of_binning, True).value_counts()

In [31]:
# dataframe of petal-width and petal category
df_petal_width = pd.DataFrame(pd.concat((PETAL_WIDTH_SERIES, petal_width_ew_binning, IRIS_SPECIES), axis=1))

In [32]:
# change columns name
df_petal_width.columns = ["petal.width", "category", "species"]

In [33]:
df_petal_width

Unnamed: 0,petal.width,category,species
0,0.2,sedikit_lebar,Iris-setosa
1,0.2,sedikit_lebar,Iris-setosa
2,0.2,sedikit_lebar,Iris-setosa
3,0.2,sedikit_lebar,Iris-setosa
4,0.2,sedikit_lebar,Iris-setosa
...,...,...,...
145,2.3,sangat_lebar,Iris-virginica
146,1.9,sangat_lebar,Iris-virginica
147,2.0,sangat_lebar,Iris-virginica
148,2.3,sangat_lebar,Iris-virginica


In [34]:
# equal-width intervals with label
labelled_petal_width_ew_binning

lebar            54
sedikit_lebar    50
sangat_lebar     46
Name: petal_width, dtype: int64

In [36]:
# equal-width intervals without label
interval_petal_width_ew_binning

(0.9, 1.7]       54
(0.0976, 0.9]    50
(1.7, 2.5]       46
Name: petal_width, dtype: int64

### Sepal Lenght

In [37]:
# equal-width intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

sepal_length_ew_binning = pd.cut(SEPAL_LENGTH_SERIES, amount_of_binning, True, labels)
labelled_sepal_length_ew_binning = sepal_length_ew_binning.value_counts()
interval_sepal_length_ew_binning = pd.cut(SEPAL_LENGTH_SERIES, amount_of_binning, True).value_counts()

In [38]:
# dataframe of sepal-width and sepal category
df_sepal_length_ew = pd.DataFrame(pd.concat((SEPAL_LENGTH_SERIES, sepal_length_ew_binning, IRIS_SPECIES), axis=1))

In [40]:
# change columns name
df_sepal_length_ew.columns = ["sepal_length", "category", "species"]

In [41]:
df_sepal_length_ew

Unnamed: 0,sepal_length,category,species
0,5.1,sedikit_lebar,Iris-setosa
1,4.9,sedikit_lebar,Iris-setosa
2,4.7,sedikit_lebar,Iris-setosa
3,4.6,sedikit_lebar,Iris-setosa
4,5.0,sedikit_lebar,Iris-setosa
...,...,...,...
145,6.7,lebar,Iris-virginica
146,6.3,lebar,Iris-virginica
147,6.5,lebar,Iris-virginica
148,6.2,lebar,Iris-virginica


In [42]:
# equal-width intervals with label
labelled_sepal_length_ew_binning

lebar            71
sedikit_lebar    59
sangat_lebar     20
Name: sepal_length, dtype: int64

In [43]:
# equal-width intervals without label
interval_sepal_length_ew_binning

(5.5, 6.7]      71
(4.296, 5.5]    59
(6.7, 7.9]      20
Name: sepal_length, dtype: int64

### Petal Lenght

In [44]:
# equal-width intervals
labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_length_ew_binning = pd.cut(PETAL_LENGTH_SERIES, amount_of_binning, True, labels)
labelled_petal_length_ew_binning = petal_length_ew_binning.value_counts()
interval_petal_length_ew_binning = pd.cut(PETAL_LENGTH_SERIES, amount_of_binning, True).value_counts()

In [46]:
# dataframe of petal-width and petal category
df_petal_length_ew =  pd.DataFrame(pd.concat((PETAL_LENGTH_SERIES,petal_length_ew_binning, IRIS_SPECIES), axis=1))

In [48]:
# change columns name
df_petal_length_ew.columns = ["petal_length", "category", "species"]

In [49]:
df_petal_length_ew

Unnamed: 0,petal_length,category,species
0,1.4,sedikit_lebar,Iris-setosa
1,1.4,sedikit_lebar,Iris-setosa
2,1.3,sedikit_lebar,Iris-setosa
3,1.5,sedikit_lebar,Iris-setosa
4,1.4,sedikit_lebar,Iris-setosa
...,...,...,...
145,5.2,sangat_lebar,Iris-virginica
146,5.0,sangat_lebar,Iris-virginica
147,5.2,sangat_lebar,Iris-virginica
148,5.4,sangat_lebar,Iris-virginica


In [50]:
# equal-width intervals binning with label
labelled_petal_length_ew_binning

lebar            54
sedikit_lebar    50
sangat_lebar     46
Name: petal_length, dtype: int64

In [51]:
# equal-width intervals out label
interval_petal_length_ew_binning

(2.967, 4.933]    54
(0.994, 2.967]    50
(4.933, 6.9]      46
Name: petal_length, dtype: int64

# Rentang Frekwensi Sama (Equal-Frequency Intervals)
Pada diskritisasi frekwensi sama, kita membagi rentang dari *X* menjadi rentang rentang yang berisi jumlah data yang sama (mendekati sama), frekuensi yang sama mungkin tidak dimungkinkan karena ada nilai yang diulang.

## Qcut
salah satu tools pandas digunakan untuk melakukan perhitungan equal-frequency

### Sepal Width

In [52]:
# equal-frequency intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

sepal_width_ef_binning = pd.qcut(SEPAL_WIDTH_SERIES, amount_of_binning, labels)
labelled_sepal_width_ef_binning = sepal_width_ef_binning.value_counts()
interval_sepal_width_ef_binning = pd.qcut(SEPAL_WIDTH_SERIES, amount_of_binning).value_counts()

In [53]:
# dataframe of sepal-width and sepal category
df_sepal_width_ef = pd.DataFrame(pd.concat((SEPAL_WIDTH_SERIES, sepal_width_ef_binning, IRIS_SPECIES), axis = 1))

In [55]:
# change columns name
df_sepal_width_ef.columns = ["sepal_width", "category", "species"]

In [56]:
df_sepal_width_ef

Unnamed: 0,sepal_width,category,species
0,3.5,sangat_lebar,Iris-setosa
1,3.0,lebar,Iris-setosa
2,3.2,lebar,Iris-setosa
3,3.1,lebar,Iris-setosa
4,3.6,sangat_lebar,Iris-setosa
...,...,...,...
145,3.0,lebar,Iris-virginica
146,2.5,sedikit_lebar,Iris-virginica
147,3.0,lebar,Iris-virginica
148,3.4,sangat_lebar,Iris-virginica


In [57]:
# equal-frequency intervals binning with label
labelled_sepal_width_ef_binning

sedikit_lebar    57
lebar            51
sangat_lebar     42
Name: sepal_width, dtype: int64

In [58]:
# equal-frequency intervals out label
interval_sepal_width_ef_binning

(1.999, 2.9]    57
(2.9, 3.2]      51
(3.2, 4.4]      42
Name: sepal_width, dtype: int64

### Petal Width

In [59]:
# equal-frequency intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_width_ef_binning = pd.qcut(PETAL_WIDTH_SERIES, amount_of_binning, labels)
labelled_petal_width_ef_binning = petal_width_ef_binning.value_counts()
interval_petal_width_ef_binning = pd.qcut(PETAL_WIDTH_SERIES, amount_of_binning).value_counts()

In [60]:
# dataframe of petal-width and petal category
df_petal_width_ef = pd.DataFrame(pd.concat((PETAL_WIDTH_SERIES, petal_width_ef_binning, IRIS_SPECIES), axis = 1))

In [61]:
# change columns name
df_petal_width_ef.columns = ["petal_width", "category", "species"]

In [62]:
df_petal_width_ef

Unnamed: 0,petal_width,category,species
0,0.2,sedikit_lebar,Iris-setosa
1,0.2,sedikit_lebar,Iris-setosa
2,0.2,sedikit_lebar,Iris-setosa
3,0.2,sedikit_lebar,Iris-setosa
4,0.2,sedikit_lebar,Iris-setosa
...,...,...,...
145,2.3,sangat_lebar,Iris-virginica
146,1.9,sangat_lebar,Iris-virginica
147,2.0,sangat_lebar,Iris-virginica
148,2.3,sangat_lebar,Iris-virginica


In [63]:
# equal-frequency intervals binning with label
labelled_petal_width_ef_binning

lebar            52
sedikit_lebar    50
sangat_lebar     48
Name: petal_width, dtype: int64

In [64]:
# equal-frequency intervals without label
interval_petal_width_ef_binning

(0.867, 1.6]      52
(0.099, 0.867]    50
(1.6, 2.5]        48
Name: petal_width, dtype: int64

### Sepal Lenght

In [65]:
# equal-frequency intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

sepal_length_ef_binning = pd.qcut(SEPAL_LENGTH_SERIES, amount_of_binning, labels)
labelled_sepal_length_ef_binning = sepal_length_ef_binning.value_counts()
interval_sepal_length_ef_binning = pd.qcut(SEPAL_LENGTH_SERIES, amount_of_binning).value_counts()

In [66]:
# dataframe of sepal-length and sepal category
df_sepal_length_ef = pd.DataFrame(pd.concat((SEPAL_LENGTH_SERIES, sepal_length_ef_binning, IRIS_SPECIES), axis=1))

In [67]:
# change columns name
df_sepal_length_ef.columns = ["sepal_length", "category", "species"]

In [68]:
df_sepal_length_ef

Unnamed: 0,sepal_length,category,species
0,5.1,sedikit_lebar,Iris-setosa
1,4.9,sedikit_lebar,Iris-setosa
2,4.7,sedikit_lebar,Iris-setosa
3,4.6,sedikit_lebar,Iris-setosa
4,5.0,sedikit_lebar,Iris-setosa
...,...,...,...
145,6.7,sangat_lebar,Iris-virginica
146,6.3,lebar,Iris-virginica
147,6.5,sangat_lebar,Iris-virginica
148,6.2,lebar,Iris-virginica


In [69]:
# equal-frequency intervals binning with label
labelled_sepal_length_ef_binning

lebar            56
sedikit_lebar    52
sangat_lebar     42
Name: sepal_length, dtype: int64

In [70]:
# equal-frequency intervals out label
interval_sepal_length_ef_binning

(5.4, 6.3]                   56
(4.2989999999999995, 5.4]    52
(6.3, 7.9]                   42
Name: sepal_length, dtype: int64

### Petal Lenght

In [71]:
# equal-frequency intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_length_ef_binning = pd.qcut(PETAL_LENGTH_SERIES, amount_of_binning, labels)
labelled_petal_length_ef_binning = petal_length_ef_binning.value_counts()
interval_petal_length_ef_binning = pd.qcut(PETAL_LENGTH_SERIES, amount_of_binning).value_counts()

In [72]:
# dataframe of petal-length and petal category
df_petal_length_ef = pd.DataFrame(pd.concat((PETAL_LENGTH_SERIES, petal_length_ef_binning, IRIS_SPECIES), axis=1))

In [73]:
# change columns name
df_petal_length_ef.columns = ["petal_length", "category", "species"]

In [74]:
df_petal_length_ef

Unnamed: 0,petal_length,category,species
0,1.4,sedikit_lebar,Iris-setosa
1,1.4,sedikit_lebar,Iris-setosa
2,1.3,sedikit_lebar,Iris-setosa
3,1.5,sedikit_lebar,Iris-setosa
4,1.4,sedikit_lebar,Iris-setosa
...,...,...,...
145,5.2,sangat_lebar,Iris-virginica
146,5.0,sangat_lebar,Iris-virginica
147,5.2,sangat_lebar,Iris-virginica
148,5.4,sangat_lebar,Iris-virginica


In [75]:
# equal-frequency intervals binning with label
labelled_petal_length_ef_binning

lebar            54
sedikit_lebar    50
sangat_lebar     46
Name: petal_length, dtype: int64

In [76]:
# equal-frequency intervals out label
interval_petal_length_ef_binning

(2.633, 4.9]      54
(0.999, 2.633]    50
(4.9, 6.9]        46
Name: petal_length, dtype: int64