biasanya digunakan di regresi, sebagai metode untuk melinearkan garis pada data (dapat digunakan apabila data yang kita punya tidak linear garisnya)
* apabila polynomial degree terlalu rendah akan underfitting
* apabila polynomial degree terlalu tinggi akan overfitting

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Preprocessing   
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler,OneHotEncoder
from sklearn.compose import ColumnTransformer # melakukan transformasi (fit transform = transformer)
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Model
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score

# Utilities
import warnings
warnings.filterwarnings("ignore")
from sklearn.utils.testing import ignore_warnings

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression

In [4]:
df = pd.read_csv('4.white_wine.csv')
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.0010,3.00,0.45,8.8,6.0
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.9940,3.30,0.49,9.5,6.0
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.9951,3.26,0.44,10.1,6.0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6.0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...
515,6.1,0.31,0.26,2.2,0.051,28.0,167.0,0.9926,3.37,0.47,10.4,6.0
516,6.8,0.18,0.37,1.6,0.055,47.0,154.0,0.9934,3.08,0.45,9.1,5.0
517,7.4,0.15,0.42,1.7,0.045,49.0,154.0,0.9920,3.00,0.60,10.4,6.0
518,5.9,0.13,0.28,1.9,0.050,20.0,78.0,0.9918,3.43,0.64,10.8,6.0


Feature yang akan digunakan : density & alcohol
klasifikasi wine = 
* quality > 6 = Good Wine
* quality <=6 = Bad Wine

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         520 non-null    float64
 1   volatile acidity      520 non-null    float64
 2   citric acid           520 non-null    float64
 3   residual sugar        520 non-null    float64
 4   chlorides             520 non-null    float64
 5   free sulfur dioxide   520 non-null    float64
 6   total sulfur dioxide  520 non-null    float64
 7   density               520 non-null    float64
 8   pH                    519 non-null    float64
 9   sulphates             519 non-null    float64
 10  alcohol               519 non-null    float64
 11  quality               519 non-null    float64
dtypes: float64(12)
memory usage: 48.9 KB


In [6]:
df.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      1
sulphates               1
alcohol                 1
quality                 1
dtype: int64

terdapat missing value di feature yang akan diproses

In [8]:
df['alcohol'].fillna(df['alcohol'].mean(),inplace=True)

In [14]:
df['label']=np.where(df['quality']>6.0,1,0)
df['label'].value_counts()

0    422
1     98
Name: label, dtype: int64

Good wine lebih sedikit daripada Bad wine

## Data Splitting

In [16]:
X = df[['density','alcohol']]
y = df['label']

In [17]:
X_train,X_test,y_train,y_test = train_test_split(X,y,
                                                stratify=y,
                                                random_state=2020)

## Melakukan modelling tanpa polynomial

In [18]:
logreg = LogisticRegression()
logreg.fit(X_train,y_train)

LogisticRegression()

In [19]:
y_pred = logreg.predict(X_test)
accuracy_score(y_test,y_pred)

0.8538461538461538

nilai yang didapat dari model tanpa melakukan polynomial adalah 85,38%

## Melakukan modelling dengan polynomial

In [51]:
poly = PolynomialFeatures(degree=3,interaction_only=False,include_bias=False) # Declare fungsi terlebih dahulu

In [52]:
X_train_poly = poly.fit_transform(X_train)
X_test_poly  = poly.fit_transform(X_test)

In [53]:
pd.DataFrame(X_train_poly)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.9936,9.1,0.987241,9.04176,82.81,0.980923,8.983893,82.280016,753.571
1,0.9998,8.7,0.999600,8.69826,75.69,0.999400,8.696520,75.674862,658.503
2,0.9934,9.7,0.986844,9.63598,94.09,0.980330,9.572383,93.469006,912.673
3,0.9910,12.6,0.982081,12.48660,158.76,0.973242,12.374221,157.331160,2000.376
4,0.9931,10.6,0.986248,10.52686,112.36,0.979443,10.454225,111.584716,1191.016
...,...,...,...,...,...,...,...,...,...
385,0.9927,9.4,0.985453,9.33138,88.36,0.978259,9.263261,87.714972,830.584
386,0.9955,10.4,0.991020,10.35320,108.16,0.986561,10.306611,107.673280,1124.864
387,0.9949,9.0,0.989826,8.95410,81.00,0.984778,8.908434,80.586900,729.000
388,0.9974,10.5,0.994807,10.47270,110.25,0.992220,10.445471,109.963350,1157.625


feature 0 dan 1 merupakan feature awal sebelum dilakukan polynomial

In [54]:
poly.get_feature_names()

['x0', 'x1', 'x0^2', 'x0 x1', 'x1^2', 'x0^3', 'x0^2 x1', 'x0 x1^2', 'x1^3']

In [55]:
pd.DataFrame(X_train_poly, columns=poly.get_feature_names())

Unnamed: 0,x0,x1,x0^2,x0 x1,x1^2,x0^3,x0^2 x1,x0 x1^2,x1^3
0,0.9936,9.1,0.987241,9.04176,82.81,0.980923,8.983893,82.280016,753.571
1,0.9998,8.7,0.999600,8.69826,75.69,0.999400,8.696520,75.674862,658.503
2,0.9934,9.7,0.986844,9.63598,94.09,0.980330,9.572383,93.469006,912.673
3,0.9910,12.6,0.982081,12.48660,158.76,0.973242,12.374221,157.331160,2000.376
4,0.9931,10.6,0.986248,10.52686,112.36,0.979443,10.454225,111.584716,1191.016
...,...,...,...,...,...,...,...,...,...
385,0.9927,9.4,0.985453,9.33138,88.36,0.978259,9.263261,87.714972,830.584
386,0.9955,10.4,0.991020,10.35320,108.16,0.986561,10.306611,107.673280,1124.864
387,0.9949,9.0,0.989826,8.95410,81.00,0.984778,8.908434,80.586900,729.000
388,0.9974,10.5,0.994807,10.47270,110.25,0.992220,10.445471,109.963350,1157.625


In [56]:
lr = LogisticRegression()
lr.fit(X_train_poly,y_train)

LogisticRegression()

In [45]:
y_pred = lr.predict(X_test_poly)
accuracy_score(y_test,y_pred)

0.9692307692307692

terdapat kenaikan akurasi model menjadi 96,92 %