## Daniel Barella
## 9/19/25

# 📘 Day 20: Feature Scaling + Mini Project

## 🧠 Concepts Learned

- Feature scaling ensures all features contribute equally in machine learning models.

- Two main approaches:

    - Normalization (rescale values between 0–1).

    - Standardization (rescale to mean = 0, std = 1).

- Many ML algorithms (k-NN, SVM, PCA, gradient descent) perform better when data is scaled.

## 🛠️ Practice Work

### Load the Data

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Load cleaned wine dataset
wine = pd.read_csv("wine_clean.csv")
wine.head()


Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5.0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5.0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5.0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6.0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5.0


### Apply Normalization

In [2]:
scaler = MinMaxScaler()
normalized = scaler.fit_transform(wine.drop("quality", axis=1))

normalized_df = pd.DataFrame(normalized, columns=wine.columns[:-1])
normalized_df["quality"] = wine["quality"]

normalized_df.head()


Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,0.247788,0.39726,0.0,0.068493,0.106845,0.140845,0.09894,0.567548,0.606299,0.137725,0.153846,5.0
1,0.283186,0.520548,0.0,0.116438,0.143573,0.338028,0.215548,0.494126,0.362205,0.209581,0.215385,5.0
2,0.283186,0.438356,0.04,0.09589,0.133556,0.197183,0.169611,0.508811,0.409449,0.191617,0.215385,5.0
3,0.584071,0.109589,0.56,0.068493,0.105175,0.225352,0.190813,0.582232,0.330709,0.149701,0.215385,6.0
4,0.247788,0.39726,0.0,0.068493,0.106845,0.140845,0.09894,0.567548,0.606299,0.137725,0.153846,5.0


### Apply Standardization

In [3]:
scaler = StandardScaler()
standardized = scaler.fit_transform(wine.drop("quality", axis=1))

standardized_df = pd.DataFrame(standardized, columns=wine.columns[:-1])
standardized_df["quality"] = wine["quality"]

standardized_df.head()


Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5.0
1,-0.298547,1.967442,-1.391472,0.043416,0.223875,0.872638,0.624363,0.028261,-0.719933,0.12895,-0.584777,5.0
2,-0.298547,1.297065,-1.18607,-0.169427,0.096353,-0.083669,0.229047,0.134264,-0.331177,-0.048089,-0.584777,5.0
3,1.654856,-1.384443,1.484154,-0.453218,-0.26496,0.107592,0.4115,0.664277,-0.979104,-0.46118,-0.584777,6.0
4,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5.0


## 🎯 Mini Project: Predict Wine Quality with k-NN

### Objective: Compare performance with and without feature scaling.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Train/test split
X = wine.drop("quality", axis=1)
y = wine["quality"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Baseline (no scaling)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
baseline_acc = accuracy_score(y_test, knn.predict(X_test))

# With Standardization
X_scaled = standardized_df.drop("quality", axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

knn.fit(X_train, y_train)
scaled_acc = accuracy_score(y_test, knn.predict(X_test))

baseline_acc, scaled_acc


(0.45625, 0.553125)

### Outcome:

Accuracy improves after scaling (k-NN is distance-based and benefits from scaling).