# Feature engineering
adalah proses menggunakan pengetahuan domain untuk memilih, memodifikasi, atau membuat fitur baru dari data mentah guna meningkatkan kinerja model pembelajaran mesin. Fitur-fitur ini dapat membantu model dalam mengenali pola dan membuat prediksi yang lebih akurat.
## Tujuan Feature Engineering
- Meningkatkan kinerja model pembelajaran mesin.
- Mengurangi kompleksitas model.
- Meningkatkan interpretabilitas model.
## Teknik Feature Engineering
1. **Pembuatan Fitur Baru**: Membuat fitur baru dari fitur yang sudah ada, misalnya dengan menggabungkan beberapa fitur atau melakukan transformasi matematika.
2. **Seleksi Fitur**: Memilih fitur-fitur yang paling relevan dan menghapus fitur-fitur yang tidak penting atau redundan.
3. **Transformasi Fitur**: Mengubah skala fitur, misalnya dengan normalisasi atau standarisasi.
4. **Encoding Kategori**: Mengubah data kategori menjadi format numerik yang dapat digunakan oleh model pembelajaran mesin, seperti one-hot encoding atau label encoding.
5. **Penanganan Nilai Hilang**: Mengisi atau menghapus nilai yang hilang dalam dataset.
6. **Discretization**: Mengubah fitur kontinu menjadi fitur diskrit dengan membagi rentang nilai menjadi beberapa interval.
## Optimasi Model
- KNN (K-Nearest Neighbors) bisa pakai scaling, sedangkan Decision Tree tidak perlu.
- Beberapa algoritma sensitif terhadap fitur yang tidak relevan, sehingga seleksi fitur penting.
- Transformasi fitur dapat membantu algoritma tertentu untuk konvergen lebih cepat.
## Kesimpulan
Feature engineering adalah langkah krusial dalam proses pengembangan model pembelajaran mesin yang dapat secara signifikan mempengaruhi kinerja dan efektivitas model.

## Scaling Methods
Scaling adalah teknik untuk mengubah skala fitur agar sesuai dengan rentang tertentu. Beberapa metode scaling yang umum digunakan meliputi:
1. **Min-Max Scaling**: Mengubah nilai fitur ke dalam rentang [0, 1].
   - Rumus:
     \[ X' = \frac{X - X_{min}}{X_{max} - X_{min}} \]
2. **Standardization (Z-score Normalization)**: Mengubah fitur sehingga memiliki mean 0 dan standar deviasi 1.
   - Rumus:
     \[ X' = \frac{X - \mu}{\sigma} \]
3. **Robust Scaling**: Menggunakan median dan interquartile range untuk mengurangi pengaruh outlier.
   - Rumus:
     \[ X' = \frac{X - \text{median}}{IQR} \]
4. **MaxAbs Scaling**: Mengubah nilai fitur ke dalam rentang [-1, 1] berdasarkan nilai absolut maksimum.
   - Rumus:
     \[ X' = \frac{X}{|X_{max}|} \]
Pemilihan metode scaling yang tepat tergantung pada karakteristik data dan algoritma pembelajaran mesin yang digunakan.

In [68]:
! pip install category_encoders



## **1. Scaling**
---
- min-max scaling
- standardization
- robust scaling

In [69]:
# library
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

In [70]:
# Data
bank = pd.read_csv('bankloan.csv')
bank.head()

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41,3,17,12,176,9.3,11.359392,5.008608,1
1,27,1,10,6,31,17.3,1.362202,4.000798,0
2,40,1,15,14,55,5.5,0.856075,2.168925,0
3,41,1,15,14,120,2.9,2.65872,0.82128,0
4,24,2,2,0,28,17.3,1.787436,3.056564,1


In [71]:
# Split data

x = bank[['age', 'ed', 'employ', 'income', 'debtinc', 'creddebt', 'othdebt']]
y = bank['default']

xtrain, xtest, ytrain, ytest = train_test_split(
    x, 
    y, 
    test_size=0.2, 
    random_state=2025,
    stratify=y
)

### A. Scaling (More Deep)

In [72]:
# Definisi scaling

minmax = MinMaxScaler()
stdscal = StandardScaler()
robust = RobustScaler()


In [73]:
# minmax scaler
xtrainMinMax = minmax.fit_transform(xtrain)
xtestMinMax = minmax.transform(xtest)
pd.DataFrame(xtrainMinMax, columns=xtrain.columns).head()

Unnamed: 0,age,ed,employ,income,debtinc,creddebt,othdebt
0,0.371429,0.25,0.290323,0.030093,0.035912,0.013793,0.004383
1,0.371429,0.75,0.225806,0.060185,0.165746,0.045649,0.057976
2,0.4,0.0,0.322581,0.05787,0.146409,0.010032,0.072609
3,0.114286,0.0,0.096774,0.00463,0.480663,0.048631,0.066377
4,0.514286,0.0,0.0,0.05787,0.207182,0.051307,0.072973


In [74]:
# Model KNN tanpa scaling

knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(xtrain, ytrain)

pred_knn = knn.predict(xtest)
acc_knn = accuracy_score(ytest, pred_knn)
print(f'Accuracy KNN tanpa scaling: {acc_knn*100:.2f}%')

Accuracy KNN tanpa scaling: 75.71%


In [75]:
# Model knn dengan scaling MinMaxScaler
knn_minmax = KNeighborsClassifier(n_neighbors=7)
knn_minmax.fit(xtrainMinMax, ytrain)

xtestMinMax = minmax.transform(xtest)
pred_knn_minmax = knn_minmax.predict(xtestMinMax)
acc_knn_minmax = accuracy_score(ytest, pred_knn_minmax)
print(f'Accuracy KNN dengan MinMaxScaler: {acc_knn_minmax*100:.2f}%')

Accuracy KNN dengan MinMaxScaler: 74.29%


In [76]:
# Standard Scaler
xtrainStd = stdscal.fit_transform(xtrain)
xtestStd = stdscal.transform(xtest)
pd.DataFrame(xtrainStd, columns=xtrain.columns).head()

Unnamed: 0,age,ed,employ,income,debtinc,creddebt,othdebt
0,-0.088262,0.294667,0.120121,-0.498444,-1.266817,-0.594351,-0.885581
1,-0.088262,2.423871,-0.187741,-0.152692,-0.562137,-0.28322,-0.439537
2,0.037827,-0.769936,0.274052,-0.179288,-0.667089,-0.631082,-0.317753
3,-1.223063,-0.769936,-0.803464,-0.791004,1.147085,-0.254094,-0.369619
4,0.542183,-0.769936,-1.265256,-0.179288,-0.337239,-0.22796,-0.314722


In [77]:
# Robust scaler
xtrainRobust = robust.fit_transform(xtrain)
xtestrobust = robust.transform(xtest)
pd.DataFrame(xtrainRobust, columns=xtrain.columns).head()


Unnamed: 0,age,ed,employ,income,debtinc,creddebt,othdebt
0,0.0,1.0,0.222222,-0.208,-0.807122,-0.35928,-0.612241
1,0.0,3.0,0.0,0.208,-0.249258,0.068511,-0.114378
2,0.083333,0.0,0.333333,0.176,-0.332344,-0.409783,0.021554
3,-0.75,0.0,-0.444444,-0.56,1.103858,0.108557,-0.036338
4,0.416667,0.0,-0.777778,0.176,-0.071217,0.14449,0.024937


In [78]:
# Model KNN dengan StandardScaler
knn_std = KNeighborsClassifier(n_neighbors=7)
knn_std.fit(xtrainStd, ytrain) 
pred_knn_std = knn_std.predict(xtestStd)
acc_knn_std = accuracy_score(ytest, pred_knn_std)
print(f'Accuracy KNN dengan StandardScaler: {acc_knn_std*100:.2f}%')

Accuracy KNN dengan StandardScaler: 77.14%


In [79]:
# Model KNN dengan RobustScaler
knn_robust = KNeighborsClassifier(n_neighbors=7)
knn_robust.fit(xtrainRobust, ytrain)
pred_knn_robust = knn_robust.predict(xtestrobust)
acc_knn_robust = accuracy_score(ytest, pred_knn_robust)
print(f'Accuracy KNN dengan RobustScaler: {acc_knn_robust*100:.2f}%')

Accuracy KNN dengan RobustScaler: 76.43%


In [80]:
# Rekapitulasi hasil performa model KNN dengan dan tanpa scaling
results = pd.DataFrame({
    'Model': ['KNN tanpa scaling', 'KNN dengan MinMaxScaler', 'KNN dengan StandardScaler', 'KNN dengan RobustScaler'],
    'Accuracy': [acc_knn, acc_knn_minmax, acc_knn_std, acc_knn_robust]
})
# konversi ke persen dan pembulatan 2 desimal
results['Accuracy'] = (results['Accuracy'] * 100).round(2).astype(str) + '%'
print("\nRekapitulasi Hasil Performa Model KNN:")
results


Rekapitulasi Hasil Performa Model KNN:


Unnamed: 0,Model,Accuracy
0,KNN tanpa scaling,75.71%
1,KNN dengan MinMaxScaler,74.29%
2,KNN dengan StandardScaler,77.14%
3,KNN dengan RobustScaler,76.43%


In [81]:
# issue yang muncul
dataBaru = xtrain.sample(2)
dataBaru

Unnamed: 0,age,ed,employ,income,debtinc,creddebt,othdebt
218,27,2,7,30,4.0,0.4476,0.7524
692,53,1,0,27,28.9,2.754459,5.048541


In [82]:
knn.predict(stdscal.transform(dataBaru))

array([0, 0])

### B. Best Practice Scaling

In [83]:
# Definisikan pre processing

prepros = ColumnTransformer([
        ('Standard Scaler', StandardScaler(), ['age','ed','employ','income'])],
        remainder='passthrough')

prepros

In [84]:
# Model

pipelineKNN = Pipeline([
    ('preprocessing', prepros),
    ('model', knn)
])

pipelineKNN.fit(xtrain, ytrain)

In [85]:
# evaluasi model

pred = pipelineKNN.predict(xtest)
acc = accuracy_score(ytest, pred)
print(pred)
print(f'Accuracy KNN dengan Pipeline: {acc*100:.2f}%')

[0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0
 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0]
Accuracy KNN dengan Pipeline: 75.00%


In [86]:
# pediksi data baru
pipelineKNN.predict(dataBaru)

array([0, 1])

** Challenge**
1. Gunakan dataset white_wine.csv
2. Buat sebuah kolom baru yang kan akan jadi target dengan kelompok klasifikasi quality > 6 "Good", < 6 "not good". Fokus model ml nya predict yang not good
3. Buktikan apakah scaling pada model logreg dan svm (kernel rbf) itu berpengaruh atau tidak
4. Buktikan apakah scaling berpengaruh di model DT

Rangkum dalam sebuah dataframe

In [87]:
# import library tambahan
from sklearn.svm import SVC

# Load data
wine = pd.read_csv('white_wine.csv')
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6.0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6.0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6.0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6.0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6.0


In [88]:
# Buat kolom target
wine['category'] = wine['quality'].apply(lambda x: 'Good' if x > 6 else 'Not Good')
wine['category'].value_counts()

category
Not Good    422
Good         98
Name: count, dtype: int64

In [89]:
# Cek missing values
print("Missing values:")
print(wine.isnull().sum())
print("\nTotal missing values:", wine.isnull().sum().sum())

Missing values:
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      1
sulphates               1
alcohol                 1
quality                 1
category                0
dtype: int64

Total missing values: 4


In [90]:
# Hapus missing values
wine = wine.dropna()
print(f"Jumlah data setelah hapus missing values: {len(wine)}")

Jumlah data setelah hapus missing values: 519


In [91]:
# Split data
x = wine.drop(['quality', 'category'], axis=1)
y = wine['category']

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=2025, stratify=y)

In [92]:
# Scaling data
scaler = StandardScaler()
xtrainScaled = scaler.fit_transform(xtrain)
xtestScaled = scaler.transform(xtest)

In [93]:
# Logistic Regression tanpa scaling
logreg = LogisticRegression()
logreg.fit(xtrain, ytrain)
pred_logreg = logreg.predict(xtest)
acc_logreg = accuracy_score(ytest, pred_logreg)
print(f'Accuracy Logistic Regression tanpa scaling: {acc_logreg*100:.2f}%')

Accuracy Logistic Regression tanpa scaling: 89.42%


In [94]:
# Logistic Regression dengan scaling
logreg_scaled = LogisticRegression()
logreg_scaled.fit(xtrainScaled, ytrain)
pred_logreg_scaled = logreg_scaled.predict(xtestScaled)
acc_logreg_scaled = accuracy_score(ytest, pred_logreg_scaled)
print(f'Accuracy Logistic Regression dengan scaling: {acc_logreg_scaled*100:.2f}%')

Accuracy Logistic Regression dengan scaling: 100.00%


In [95]:
# SVM tanpa scaling
svm = SVC(kernel='rbf')
svm.fit(xtrain, ytrain)
pred_svm = svm.predict(xtest)
acc_svm = accuracy_score(ytest, pred_svm)
print(f'Accuracy SVM tanpa scaling: {acc_svm*100:.2f}%')

Accuracy SVM tanpa scaling: 80.77%


In [96]:
# SVM dengan scaling
svm_scaled = SVC(kernel='rbf')
svm_scaled.fit(xtrainScaled, ytrain)
pred_svm_scaled = svm_scaled.predict(xtestScaled)
acc_svm_scaled = accuracy_score(ytest, pred_svm_scaled)
print(f'Accuracy SVM dengan scaling: {acc_svm_scaled*100:.2f}%')

Accuracy SVM dengan scaling: 100.00%


In [97]:
# Decision Tree tanpa scaling
dt = DecisionTreeClassifier()
dt.fit(xtrain, ytrain)
pred_dt = dt.predict(xtest)
acc_dt = accuracy_score(ytest, pred_dt)
print(f'Accuracy Decision Tree tanpa scaling: {acc_dt*100:.2f}%')

Accuracy Decision Tree tanpa scaling: 99.04%


In [98]:
# Decision Tree dengan scaling
dt_scaled = DecisionTreeClassifier()
dt_scaled.fit(xtrainScaled, ytrain)
pred_dt_scaled = dt_scaled.predict(xtestScaled)
acc_dt_scaled = accuracy_score(ytest, pred_dt_scaled)
print(f'Accuracy Decision Tree dengan scaling: {acc_dt_scaled*100:.2f}%')

Accuracy Decision Tree dengan scaling: 98.08%


In [99]:
# Hasil
hasil = pd.DataFrame({
    'Model': ['Logistic Regression tanpa scaling', 
              'Logistic Regression dengan scaling',
              'SVM tanpa scaling',
              'SVM dengan scaling',
              'Decision Tree tanpa scaling',
              'Decision Tree dengan scaling'],
    'Accuracy': [acc_logreg, acc_logreg_scaled, acc_svm, acc_svm_scaled, acc_dt, acc_dt_scaled]
})

hasil['Accuracy'] = (hasil['Accuracy'] * 100).round(2).astype(str) + '%'
print("\nRangkuman Hasil:")
hasil


Rangkuman Hasil:


Unnamed: 0,Model,Accuracy
0,Logistic Regression tanpa scaling,89.42%
1,Logistic Regression dengan scaling,100.0%
2,SVM tanpa scaling,80.77%
3,SVM dengan scaling,100.0%
4,Decision Tree tanpa scaling,99.04%
5,Decision Tree dengan scaling,98.08%


## Encoding Methods
Encoding adalah proses mengubah data kategori menjadi format numerik yang dapat digunakan oleh algoritma pembelajaran mesin. Beberapa metode encoding yang umum digunakan meliputi:
1. **Label Encoding**: Mengubah setiap kategori menjadi angka unik.
2. **One-Hot Encoding**: Membuat kolom biner untuk setiap kategori.
3. **Binary Encoding**: Mengubah kategori menjadi representasi biner.
4. **Frequency Encoding**: Mengganti kategori dengan frekuensi kemunculannya.
5. **Target Encoding**: Mengganti kategori dengan rata-rata target variabel untuk kategori tersebut.
Pemilihan metode encoding yang tepat tergantung pada karakteristik data dan algoritma pembelajaran mesin yang digunakan.

In [107]:
# library
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

In [108]:
# data tips
tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [109]:
# kategorisasi tips

tips['tip_category'] = np.where(tips['tip'] < 4, 0 , 1)
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_category
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0
2,21.01,3.5,Male,No,Sun,Dinner,3,0
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,0


In [114]:
# definisikan pre processing
x = tips.drop(columns=['tip', 'tip_category'])
y = tips['tip_category']

preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(drop='first'), ['sex', 'smoker', 'day', 'time'])
    ], remainder='passthrough'
)
preprocessor.fit_transform(x)


array([[ 0.  ,  0.  ,  0.  , ...,  0.  , 16.99,  2.  ],
       [ 1.  ,  0.  ,  0.  , ...,  0.  , 10.34,  3.  ],
       [ 1.  ,  0.  ,  0.  , ...,  0.  , 21.01,  3.  ],
       ...,
       [ 1.  ,  1.  ,  1.  , ...,  0.  , 22.67,  2.  ],
       [ 1.  ,  0.  ,  1.  , ...,  0.  , 17.82,  2.  ],
       [ 0.  ,  0.  ,  0.  , ...,  0.  , 18.78,  2.  ]])

In [115]:
# mengubah data kategori menjadi numerik dengan OneHotEncoder
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(drop='first'), ['sex', 'smoker', 'day', 'time'])
    ], remainder='passthrough'
)   

In [116]:
# split data
x = tips.drop(columns=['tip', 'tip_category'])
y = tips['tip_category']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2025, stratify=y)

In [120]:
knn = KNeighborsClassifier(n_neighbors=5)
pipelineKNN = Pipeline([
    ('preprocessing', preprocessor),
    ('model', knn)
])
pipelineKNN.fit(x_train, y_train)

# evaluasi model
y_pred = pipelineKNN.predict(x_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")

Accuracy: 77.55%
