# Normalizer

## Description from sklearn-site:

Normalize samples individually to unit norm.

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.

This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion).

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

## True meaning of Mr.N will be in the end of page :]

### Imports

In [109]:
import pandas as pd

from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

from sklearn.linear_model import LogisticRegression

### Fuction for easy display

In [110]:
def dfinfo(df):
    print('======\n= HEAD\n======')
    display(df.head(10))
    
    print('======\n= DESC\n======')
    display(df.describe())

### Load data

In [111]:
data = load_breast_cancer()

X = pd.DataFrame(data.data)
X.columns = data.feature_names.tolist()

Y = pd.DataFrame(data.target)
Y.columns = ['result']

### Info about X

In [112]:
dfinfo(X)

= HEAD


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,...,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368
7,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,...,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,...,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,...,15.09,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075


= DESC


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


### Separating

In [113]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)

### LogisticRegression time

In [114]:
lor = LogisticRegression()
lor.fit(X_train, y_train.values.ravel())

print('Train score:\t{:.3f}'.format(lor.score(X_train, y_train)))
print('Test score:\t{:.3f}'.format(lor.score(X_test, y_test)))

Train score:	0.960
Test score:	0.958


### Normalizing

In [115]:
N = Normalizer().fit(X)
X_n = pd.DataFrame(N.transform(X))
X_n.columns = X.columns

dfinfo(X_n)

= HEAD


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,0.007925,0.004573,0.054099,0.440986,5.2e-05,0.000122,0.000132,6.5e-05,0.000107,3.5e-05,...,0.011181,0.007635,0.081325,0.889462,7.1e-05,0.000293,0.000314,0.000117,0.000203,5.2e-05
1,0.008666,0.007486,0.055988,0.558619,3.6e-05,3.3e-05,3.7e-05,3e-05,7.6e-05,2.4e-05,...,0.010528,0.009862,0.066899,0.824026,5.2e-05,7.9e-05,0.000102,7.8e-05,0.000116,3.8e-05
2,0.009367,0.010109,0.061842,0.572276,5.2e-05,7.6e-05,9.4e-05,6.1e-05,9.8e-05,2.9e-05,...,0.011212,0.012145,0.072545,0.812984,6.9e-05,0.000202,0.000214,0.000116,0.000172,4.2e-05
3,0.016325,0.029133,0.110899,0.551922,0.000204,0.000406,0.000345,0.00015,0.000371,0.000139,...,0.021314,0.037881,0.141333,0.811515,0.0003,0.001238,0.000982,0.000368,0.000949,0.000247
4,0.009883,0.006985,0.065808,0.631774,4.9e-05,6.5e-05,9.6e-05,5.1e-05,8.8e-05,2.9e-05,...,0.010979,0.00812,0.074137,0.767189,6.7e-05,0.0001,0.000195,7.9e-05,0.000115,3.7e-05
5,0.013945,0.017586,0.092486,0.534398,0.000143,0.00019,0.000177,9.1e-05,0.000234,8.5e-05,...,0.017328,0.026602,0.115818,0.830664,0.000201,0.000588,0.0006,0.000195,0.000446,0.000139
6,0.009483,0.010382,0.062147,0.540411,4.9e-05,5.7e-05,5.9e-05,3.8e-05,9.3e-05,3e-05,...,0.011889,0.014373,0.079607,0.83452,7.5e-05,0.000134,0.000197,0.0001,0.000159,4.3e-05
7,0.012712,0.019313,0.083631,0.535813,0.00011,0.000153,8.7e-05,5.5e-05,0.000204,6.9e-05,...,0.015818,0.026091,0.102545,0.831674,0.000153,0.000341,0.000248,0.000144,0.000296,0.000107
8,0.0142,0.023834,0.095577,0.567784,0.000139,0.000211,0.000203,0.000102,0.000257,8.1e-05,...,0.01692,0.033567,0.116004,0.807547,0.000186,0.00059,0.000589,0.000225,0.000478,0.000117
9,0.014365,0.027716,0.096808,0.548661,0.000137,0.000276,0.000262,9.8e-05,0.000234,9.5e-05,...,0.017397,0.0469,0.11258,0.820167,0.000214,0.00122,0.001274,0.000255,0.000503,0.000239


= DESC


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,0.014843,0.02201,0.096006,0.605084,0.000113,0.00011,7.9e-05,4.2e-05,0.000212,7.6e-05,...,0.016745,0.029162,0.109652,0.777287,0.000155,0.000262,0.000261,0.000108,0.000335,9.9e-05
std,0.004011,0.011077,0.024912,0.047241,5.9e-05,6.5e-05,7.3e-05,2.3e-05,0.000111,4.3e-05,...,0.004009,0.014452,0.025127,0.037726,8.3e-05,0.000175,0.000236,5.3e-05,0.00017,5.6e-05
min,0.005512,0.004568,0.036396,0.376233,2.2e-05,2.4e-05,0.0,0.0,4.1e-05,1.1e-05,...,0.007245,0.005154,0.050496,0.696047,2.7e-05,4.2e-05,0.0,0.0,4.6e-05,1.5e-05
25%,0.01169,0.014104,0.076533,0.582554,7.2e-05,6.8e-05,3.9e-05,2.7e-05,0.000137,4.5e-05,...,0.013843,0.018349,0.092567,0.749906,9.9e-05,0.000148,0.000139,7.7e-05,0.000223,6.1e-05
50%,0.015061,0.020352,0.097151,0.615219,0.000104,9.4e-05,6.4e-05,3.8e-05,0.000194,6.9e-05,...,0.016722,0.027027,0.10976,0.768914,0.000142,0.000222,0.000213,0.000102,0.000308,8.9e-05
75%,0.017354,0.026888,0.111015,0.639475,0.000141,0.000133,0.000103,5.4e-05,0.000264,9.4e-05,...,0.019225,0.036616,0.12556,0.799809,0.00019,0.000322,0.00032,0.000133,0.000422,0.00012
max,0.02847,0.086609,0.178585,0.697401,0.000477,0.000557,0.000826,0.000161,0.000787,0.000319,...,0.03234,0.108567,0.205583,0.921243,0.000646,0.001474,0.002974,0.000416,0.001196,0.000445


### Last separating

In [116]:
X_train_n, X_test_n, y_train, y_test = train_test_split(X_n, Y, random_state=0)

### Result (worse)

In [117]:
lor = LogisticRegression()
lor.fit(X_train_n, y_train.values.ravel())

print('Train score:\t{:.3f}'.format(lor.score(X_train_n, y_train)))
print('Test score:\t{:.3f}'.format(lor.score(X_test_n, y_test)))

Train score:	0.772
Test score:	0.762


## True meaning of Normalizer

### For each row

# $$\sum{X`_i^2} = 1$$

### Example

In [118]:
ex = [[5, 0, 5, 0]]
ex_n = Normalizer().fit_transform(ex)
print(ex_n)

[[0.70710678 0.         0.70710678 0.        ]]


### We have array
# $$ [5; 0; 5; 0] $$

### Normalized array

# $$ [0.70710678; 0; 0.70710678; 0] $$

### Let's power it and sum

# $$0.70710678^2 + 0^2 + 0.70710678^2 + 0^2 = 1$$

# Have a good time!

![giphy](giphy.gif)