## Feature Scalling

- Technique to standardize the independent features present in the data in a fixed range.
- Performed during the data pre-procesing to handle highly varying magnitudes or values or units.
- If feature scalling is not done, then a machine learning model tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.

**Why use Feature Scalling**
1. Scaling guarantees that all features are on a comparable scale and have comparable ranges.
2. When the features are scaled, several machine learning methods, including gradient descent-based algorithms, distance-based algorithms (such k-nearest neighbours), and support vector machines, perform better or converge more quickly.
3. Preventing numerical instability.
4. Scalling features makes ensuring that each characteristic is given the same consideration during the learning process.
    

**Normalization**

- Data preparation technique in ML that rescales numeric column in a dataset to a common state.
- This is done to help models train more effectively by reducing the impact of outliers and making features more comparable

<div style="display: flex; justify-content: center;">
    <img src="https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-a402f8bbcfa52ab9c5e772c51376f11f_l3.svg" alt="Norm" />
</div>

**Standardization**

- Data preprocessing technique in machine learning that rescales data to have mean of ) and a standard deviation of 1.
- Also known as Z-Score scaling

<div style="display: flex; justify-content: center;">
    <img src="https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-c355c38ab7c065e03d0e6f6405da2862_l3.svg" alt="Standard" />
</div>

In [13]:
# Importing Libraries

import pandas as pd

df = pd.read_csv('500hits.csv', encoding = "latin-1")
df.head()

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1


In [14]:
df.drop(columns = ["PLAYER", "CS"], inplace = True)

In [16]:
df.columns

Index(['YRS', 'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'BB', 'SO', 'SB',
       'BA', 'HOF'],
      dtype='object')

In [8]:
# splitting the data 
X1 = df.iloc[:, 0:13]
X2 = df.iloc[:, 0:13]

In [18]:
# #xample 1: Standardization scalling  - making the mean:0, std:1

# importing library
from sklearn.preprocessing import StandardScaler


X1 = StandardScaler().fit_transform(X1)

# turning X1 back into df
X1 = pd.DataFrame(X1, columns = ['YRS', 'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'BB', 'SO', 'SB',
       'BA'])
X1.head()

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA
0,2.516295,2.786078,3.034442,3.787062,4.764193,3.559333,4.389485,-0.585841,-0.346449,1.423013,-1.003628,3.832067,3.64829
1,1.792237,2.760655,2.677044,2.76053,3.444971,3.569709,1.996457,1.909487,2.175837,2.493089,-0.309948,-0.64908,1.996159
2,1.792237,2.091184,2.075964,2.528955,3.171214,4.264876,2.909053,-0.585841,-0.350567,1.826585,-1.283965,1.299723,2.657012
3,1.06818,1.972543,2.849554,2.670665,3.055576,1.691719,-0.254611,0.410896,0.858071,0.912434,2.030966,0.892346,1.004881
4,1.430208,2.099658,2.257758,2.024329,2.972977,2.68778,3.517449,-0.697364,-1.84129,0.548609,-1.065016,2.896201,1.901752


In [21]:
X1.describe().round(3)

# note: the mean become 0 and std close to 1. Achieve the standardization !

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA
count,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0
mean,0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,-0.0,0.0,-0.0
std,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001
min,-2.19,-2.027,-1.958,-1.899,-1.204,-2.116,-1.532,-1.339,-1.841,-1.665,-1.734,-1.04,-2.016
25%,-0.742,-0.697,-0.765,-0.741,-0.784,-0.715,-0.762,-0.851,-0.524,-0.76,-0.842,-0.732,-0.742
50%,-0.018,-0.157,-0.209,-0.16,-0.222,-0.155,-0.234,-0.161,0.152,-0.145,-0.046,-0.324,-0.081
75%,0.706,0.56,0.517,0.504,0.483,0.571,0.577,0.634,0.642,0.524,0.775,0.49,0.533
max,3.24,3.557,3.754,3.956,4.764,4.265,4.673,3.861,2.888,4.3,3.58,6.662,3.648


In [22]:
# Example 2: Normalization scallig - making every possible values range between 0 and 1.

# importing library
from sklearn.preprocessing import MinMaxScaler

scaleMinMax = MinMaxScaler(feature_range = (0,1))

X2 = scaleMinMax.fit_transform(X2)
# turning X1 back into df
X2 = pd.DataFrame(X2, columns = ['YRS', 'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'BB', 'SO', 'SB',
       'BA'])
X2.head()

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA
0,0.866667,0.861912,0.874035,0.971074,1.0,0.889431,0.954248,0.144772,0.316064,0.517683,0.137466,0.632595,1.0
1,0.733333,0.85736,0.811459,0.79575,0.778964,0.891057,0.568627,0.624665,0.849369,0.697078,0.268002,0.050751,0.708333
2,0.733333,0.737481,0.706217,0.756198,0.733096,1.0,0.715686,0.144772,0.315194,0.585341,0.084713,0.303788,0.825
3,0.6,0.716237,0.841663,0.780401,0.713721,0.596748,0.205882,0.336461,0.570744,0.432086,0.70851,0.250893,0.533333
4,0.666667,0.738998,0.738047,0.670012,0.699881,0.752846,0.813725,0.123324,0.0,0.371092,0.125915,0.511079,0.691667


In [24]:
X2.describe().round(3)

# note: all possible values now range form 0 to 1 except the count which impossible

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA
count,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0
mean,0.403,0.363,0.343,0.324,0.202,0.332,0.247,0.257,0.389,0.279,0.326,0.135,0.356
std,0.184,0.179,0.175,0.171,0.168,0.157,0.161,0.193,0.212,0.168,0.188,0.13,0.177
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.267,0.238,0.209,0.198,0.07,0.22,0.124,0.094,0.279,0.152,0.168,0.04,0.225
50%,0.4,0.335,0.306,0.297,0.164,0.307,0.209,0.227,0.421,0.255,0.318,0.093,0.342
75%,0.533,0.463,0.433,0.41,0.283,0.421,0.34,0.379,0.525,0.367,0.472,0.199,0.45
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
