<a href="https://colab.research.google.com/github/harishmuh/machine_learning_practices/blob/main/Polynomial_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

# **Data Preprocessing: Polynomial Features**

---

## **What Are Polynomial Features?**
Polynomial features are new features created by raising existing features to a power, and optionally multiplying them together. This technique is commonly used to let linear models capture non-linear relationships.

**Common Use Case**

Polynomial features are often used with:

* Linear Regression, to model non-linear patterns

* In feature engineering, to enrich the dataset for better model performance



## **How polynomial features help handle underfitting**


Underfitting happens when a model is too simple to capture the underlying pattern in the data. It performs poorly both on training and test data. A typical sign is low accuracy or high error even on the training set.

When we use polynomial features, we are adding complexity to the model — enabling it to learn non-linear relationships.

Instead of fitting only straight lines like:

y = ax + b

we can fit


y = a1x + a2x^2 + a3x^3 + ... + b

This allows the model to bend and curve to better follow the true trend in the data.

So, polynomial features add model flexibility and help model to fit more non-linear patterns that in overall reduce overfitting.

# **Study case**

In [1]:
# Importing basic library
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Importing libraries for machine learning modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Importing libraries for preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Importing library for polynomial features
from sklearn.preprocessing import PolynomialFeatures

# Ignore warning
import warnings
warnings.filterwarnings('ignore')

In [2]:
# loading dataset
url = 'https://raw.githubusercontent.com/harishmuh/machine_learning_practices/refs/heads/main/datasets/white_wine.csv'
df = pd.read_csv(url)[['density', 'alcohol', 'quality']]
df.head()

Unnamed: 0,density,alcohol,quality
0,1.001,8.8,6.0
1,0.994,9.5,6.0
2,0.9951,10.1,6.0
3,0.9956,9.9,6.0
4,0.9956,9.9,6.0


In [3]:
# Setting the target class
# quality > 6  : class 1
# quality <= 6 : class 0

df['quality'] = np.where(df['quality']>6, 1, 0)
df.head()

Unnamed: 0,density,alcohol,quality
0,1.001,8.8,0
1,0.994,9.5,0
2,0.9951,10.1,0
3,0.9956,9.9,0
4,0.9956,9.9,0


In [4]:
# Missing values check
df.isna().sum()

Unnamed: 0,0
density,0
alcohol,1
quality,0


In [5]:
# Handling missing values
df = df.dropna()
df.shape

(519, 3)

In [6]:
# Handling duplicates
df = df.drop_duplicates()
df.shape

(368, 3)

In [7]:
# Define features (X) and target (y)
X = df.drop(columns='quality')
y = df['quality']

In [8]:
# Data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10, stratify=y)

## **a. Benchmark model (without polynomial)**

In [9]:
# define algo
model = LogisticRegression(random_state=10)

# fit
model.fit(X_train, y_train)

# predict
y_pred = model.predict(X_test)

accuracy_score(y_test, y_pred)

0.8513513513513513

## **b. Modeling with polynomial features**

In [10]:
# define polynomial # Polynomial degree 3
poly = PolynomialFeatures(degree=3)

In [11]:
# Preprocessing
transformer = ColumnTransformer([
    ('poly', poly, ['alcohol', 'density'])
], remainder='passthrough')

In [12]:
# define algo
model = LogisticRegression(random_state=10)

# pipeline to connect polynomial and model
pipe_model = Pipeline([
    ('preprocessing', transformer),
    ('modeling', model)
])

# fit
pipe_model.fit(X_train, y_train)

# predict
y_pred_poly = pipe_model.predict(X_test)

accuracy_score(y_test, y_pred_poly)

0.9594594594594594

## **Tuning of polynomial degree**

We will try to increase the accuracy score by further optimizing the degree of polynomial. We will iterate to find the highest score of accuracy,

In [13]:
# Importing library for tuning
from sklearn.model_selection import cross_val_score, GridSearchCV

In [14]:
# Estimator
pipe_model = Pipeline([
    ('preprocessing', transformer),
    ('modeling', model)
])

# Setting parameter of polynomial degree
hyperparam = {
    'preprocessing__poly__degree': [1,2,3,4,5,6,7,8,9]
}

# Define gridsearch
gridsearch = GridSearchCV(
    estimator= pipe_model,
    param_grid= hyperparam,
    cv= 5,
    scoring= 'accuracy',
    n_jobs= -1
)


In [15]:
# Fitting
gridsearch.fit(X_train, y_train)

In [16]:
pd.DataFrame(gridsearch.cv_results_).sort_values('rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_preprocessing__poly__degree,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
3,0.043281,0.013964,0.010809,0.003582,4,{'preprocessing__poly__degree': 4},0.983051,0.898305,0.966102,0.983051,0.982759,0.962653,0.03283,1
4,0.047332,0.024191,0.007733,0.002131,5,{'preprocessing__poly__degree': 5},0.983051,0.898305,0.949153,0.983051,0.982759,0.959264,0.033172,2
5,0.05416,0.01587,0.007622,0.002166,6,{'preprocessing__poly__degree': 6},0.983051,0.898305,0.949153,0.983051,0.982759,0.959264,0.033172,2
8,0.062184,0.018569,0.010929,0.005568,9,{'preprocessing__poly__degree': 9},0.983051,0.898305,0.949153,0.983051,0.982759,0.959264,0.033172,2
7,0.064402,0.024354,0.013757,0.005397,8,{'preprocessing__poly__degree': 8},0.983051,0.898305,0.932203,0.983051,0.982759,0.955874,0.034855,5
6,0.069185,0.021666,0.010055,0.003222,7,{'preprocessing__poly__degree': 7},0.983051,0.898305,0.932203,0.983051,0.982759,0.955874,0.034855,5
2,0.042251,0.014607,0.012601,0.007608,3,{'preprocessing__poly__degree': 3},0.915254,0.898305,0.915254,0.966102,0.965517,0.932086,0.028222,7
0,0.026642,0.013502,0.010846,0.009449,1,{'preprocessing__poly__degree': 1},0.79661,0.79661,0.79661,0.881356,0.862069,0.826651,0.037295,8
1,0.040876,0.017888,0.009612,0.004707,2,{'preprocessing__poly__degree': 2},0.79661,0.79661,0.79661,0.881356,0.862069,0.826651,0.037295,8


In [17]:
# Best score
gridsearch.best_score_

np.float64(0.9626534190531852)

In [18]:
# Best parameters
gridsearch.best_params_

{'preprocessing__poly__degree': 4}

In [19]:
gridsearch.best_estimator_

## **Final Model**

We will use best parameters for this final model.


In [20]:
# define polynomial
poly = PolynomialFeatures(degree=4)

transformer = ColumnTransformer([
    ('poly', poly, ['alcohol', 'density'])
], remainder='passthrough')

pipe_model = Pipeline([
    ('preprocessing', transformer),
    ('modeling', model)
])

pipe_model

In [21]:
# define model
pipe_model = gridsearch.best_estimator_

pipe_model.fit(X_train, y_train)

y_pred_degree4 = pipe_model.predict(X_test)

accuracy_score(y_test, y_pred_degree4)

0.9864864864864865

In [22]:
print('Benchmark model (without polynomial features):', accuracy_score(y_test, y_pred))
print('Model 1. polynomial degree 3:', accuracy_score(y_test, y_pred_poly))
print('Model 2. polynomial degree 4:', accuracy_score(y_test, y_pred_degree4))

Benchmark model (without polynomial features): 0.8513513513513513
Model 1. polynomial degree 3: 0.9594594594594594
Model 2. polynomial degree 4: 0.9864864864864865


We get the model 2 that use polynomial degree 4 has the highest score of accuracy.

**Feature (after polynomial)**

This just a review to see what is happening to features. Feel free to change the degree of polynomial to find out what have changed.

In [23]:
X.head()

Unnamed: 0,density,alcohol
0,1.001,8.8
1,0.994,9.5
2,0.9951,10.1
3,0.9956,9.9
6,0.9949,9.6


In [24]:
poly = PolynomialFeatures(degree=3)

X_poly = poly.fit_transform(X)
X_poly

array([[1.00000000e+00, 1.00100000e+00, 8.80000000e+00, ...,
        8.81760880e+00, 7.75174400e+01, 6.81472000e+02],
       [1.00000000e+00, 9.94000000e-01, 9.50000000e+00, ...,
        9.38634200e+00, 8.97085000e+01, 8.57375000e+02],
       [1.00000000e+00, 9.95100000e-01, 1.01000000e+01, ...,
        1.00012625e+01, 1.01510151e+02, 1.03030100e+03],
       ...,
       [1.00000000e+00, 1.00020000e+00, 1.03000000e+01, ...,
        1.03041204e+01, 1.06111218e+02, 1.09272700e+03],
       [1.00000000e+00, 9.92600000e-01, 1.04000000e+01, ...,
        1.02466495e+01, 1.07359616e+02, 1.12486400e+03],
       [1.00000000e+00, 9.91800000e-01, 1.08000000e+01, ...,
        1.06236062e+01, 1.15683552e+02, 1.25971200e+03]])

In [25]:
poly.get_feature_names_out()

array(['1', 'density', 'alcohol', 'density^2', 'density alcohol',
       'alcohol^2', 'density^3', 'density^2 alcohol', 'density alcohol^2',
       'alcohol^3'], dtype=object)

In [26]:
X_poly_df = pd.DataFrame(X_poly, columns=poly.get_feature_names_out())
X_poly_df.head()

Unnamed: 0,1,density,alcohol,density^2,density alcohol,alcohol^2,density^3,density^2 alcohol,density alcohol^2,alcohol^3
0,1.0,1.001,8.8,1.002001,8.8088,77.44,1.003003,8.817609,77.51744,681.472
1,1.0,0.994,9.5,0.988036,9.443,90.25,0.982108,9.386342,89.7085,857.375
2,1.0,0.9951,10.1,0.990224,10.05051,102.01,0.985372,10.001263,101.510151,1030.301
3,1.0,0.9956,9.9,0.991219,9.85644,98.01,0.986858,9.813072,97.578756,970.299
4,1.0,0.9949,9.6,0.989826,9.55104,92.16,0.984778,9.50233,91.689984,884.736
