# Data normalization

Machine-learning models that use parametric learning algorithms such as linear regression frequently perform better if the data in all of the feature columns they're trained with have similar ranges. Scikit-learn offers several classes for normalizing data, including [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html), which transforms all values to a specified range (by default, 0 to 1), and [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), which centers the data around 0 with unit variance by subtracting the mean and dividing by the standard deviation. Some learning algorithms are more sensitive to unnormalized data than others. Support Vector Machine (SVM) models and neural networks in particular tend to be very sensitive to unnormalized data. Let's demonstrate by comparing logistic-regression and SVM classification models trained with 1) unnormalized data, 2) data normalized with `MinMaxScaler`, and 3) data normalized with `StandardScaler`. The dataset we'll use is the breast-cancer dataset that comes with Scikit-learn.

In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


Every row in the dataset represents a tumor, and the "target" column is the label column. 0 means the tumor is malignant, and 1 means it is not.

In [2]:
# Show the target names
print(data.target_names)

['malignant' 'benign']


Split the data for training and testing. Then create a version of the dataset that's normalized with `MinMaxScaler`, and another version that's normalized with `StandardScaler`.

In [3]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Split the dataset for training and testing
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=0)

# Scale the training data with MinMaxScaler
scaler = MinMaxScaler()
x_train_minmax = scaler.fit_transform(x_train)
x_test_minmax = scaler.transform(x_test)

# Scale the training data with StandardScaler
scaler = StandardScaler()
x_train_standard = scaler.fit_transform(x_train)
x_test_standard = scaler.transform(x_test)

## Logistic-regression classifier

Train a logistic-regression model with unnormalized data, `MinMaxScaler`-normalized data, and `StandardScaler`-normalized data and compare the results.

In [4]:
# Unnormalized data
from sklearn.linear_model import LogisticRegressionCV

model = LogisticRegressionCV(max_iter=5000, random_state=0)
model.fit(x_train, y_train)
model.score(x_test, y_test)

0.956140350877193

In [5]:
# Data normalized with MinMaxScaler
model.fit(x_train_minmax, y_train)
model.score(x_test_minmax, y_test)

0.9649122807017544

In [6]:
# Data normalized with StandardScaler
model.fit(x_train_standard, y_train)
model.score(x_test_standard, y_test)

0.9649122807017544

## Support Vector Machine (SVM) classifier

Train a support-vector machine  with unnormalized data, `MinMaxScaler`-normalized data, and `StandardScaler`-normalized data and compare the results.

In [7]:
# Unnormalized data
from sklearn.svm import SVC

model = SVC(random_state=0)
model.fit(x_train, y_train)
model.score(x_test, y_test)

0.9298245614035088

In [8]:
# Data normalized with MinMaxScaler
model.fit(x_train_minmax, y_train)
model.score(x_test_minmax, y_test)

0.9736842105263158

In [9]:
# Data normalized with StandardScaler
model.fit(x_train_standard, y_train)
model.score(x_test_standard, y_test)

0.9824561403508771