# Prepare your data with sci-kit learning

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.

Here we will see 4 different methods of data transformation such as,
1) Min - Max normalization
2) Standardization
3) Normalization
4) Binarization

In [42]:
pwd

'C:\\Users\\Karthik\\Pictures\\LinkedIn tasks\\Datasets'

In [43]:
cd "C:\Users\Karthik\Pictures\LinkedIn tasks\Datasets"

C:\Users\Karthik\Pictures\LinkedIn tasks\Datasets


In [44]:
df = pd.read_csv("winequalityN.csv")

In [45]:
df.head()

Unnamed: 0,type1,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## 1. Min-Max Normalization [Rescale data]

Minmax normalization is a normalization strategy which linearly transforms x to y= (x-min)/(max-min), where min and max are the minimum and maximum values in X, where X is the set of observed values of x.

It can be easily seen that when x=min, then y=0, and
When x=max, then y=1.
This means, the minimum value in X is mapped to 0 and the maximum value in X is mapped to 1. So, the entire range of values of X from min to max are mapped to the range 0 to 1.

In [46]:
# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
array = df.values
# separate array into input and output components
X = array[:,1:12]
Y = array[:,12]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[0.264 0.127 0.217 0.308 0.06  0.153 0.378 0.268 0.217 0.129 0.116]
 [0.207 0.147 0.205 0.015 0.066 0.045 0.29  0.133 0.45  0.152 0.217]
 [0.355 0.133 0.241 0.097 0.068 0.101 0.21  0.154 0.419 0.124 0.304]
 [0.281 0.1   0.193 0.121 0.081 0.16  0.415 0.164 0.364 0.101 0.275]
 [0.281 0.1   0.193 0.121 0.081 0.16  0.415 0.164 0.364 0.101 0.275]]




We can able to note from the above figure how the values are transformed to the range of 0 to 1 and it becomes easier to calculate. We can specify our own range.

## 2. Standardize data

Standardization is a useful technique to transform attributes with a Gaussian distribution which follows a mean of 0 and a standard deviation of 1.

In [47]:
# Standardize data (mean= 0 and std.dev= 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import StandardScaler
array = df.values
# separate array into input and output components
X = array[:,1:12]
Y = array[:,12]
scaler = StandardScaler().fit(X)
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[-0.168 -0.423  0.284  3.207 -0.315  0.816  0.961  2.1   -1.359 -0.545
  -1.419]
 [-0.707 -0.24   0.146 -0.808 -0.201 -0.93   0.288 -0.232  0.508 -0.276
  -0.832]
 [ 0.68  -0.362  0.559  0.306 -0.173 -0.029 -0.331  0.134  0.259 -0.612
  -0.329]
 [-0.014 -0.666  0.009  0.643  0.055  0.928  1.244  0.301 -0.176 -0.881
  -0.497]
 [-0.014 -0.666  0.009  0.643  0.055  0.928  1.244  0.301 -0.176 -0.881
  -0.497]]




Here each column has converted to guassian distribution which has a mean of 0 and standard deviation of 1.

## 3. Normalization

Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance)

In [48]:
# Normalization
from sklearn.preprocessing import Normalizer
import pandas
import numpy
array = df.values
# separate array into input and output components
X = array[:,1:12]
Y = array[:,12]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[3.945e-02 1.522e-03 2.029e-03 1.166e-01 2.536e-04 2.536e-01 9.580e-01
  5.641e-03 1.691e-02 2.536e-03 4.959e-02]
 [4.727e-02 2.251e-03 2.551e-03 1.200e-02 3.676e-04 1.050e-01 9.904e-01
  7.458e-03 2.476e-02 3.676e-03 7.128e-02]
 [7.891e-02 2.728e-03 3.897e-03 6.722e-02 4.871e-04 2.923e-01 9.450e-01
  9.694e-03 3.176e-02 4.287e-03 9.840e-02]
 [3.741e-02 1.195e-03 1.663e-03 4.417e-02 3.014e-04 2.442e-01 9.665e-01
  5.173e-03 1.658e-02 2.078e-03 5.144e-02]
 [3.741e-02 1.195e-03 1.663e-03 4.417e-02 3.014e-04 2.442e-01 9.665e-01
  5.173e-03 1.658e-02 2.078e-03 5.144e-02]]


## 4. Binarization

In [49]:
# Standardize data (mean= 0 and std.dev= 1)
from sklearn.preprocessing import Binarizer
import pandas
import numpy
array = df.values
# separate array into input and output components
X = array[:,1:12]
Y = array[:,12]
binarizer = Binarizer(threshold=1.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])

[[1. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1.]
 [1. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1.]]


It is the process of converting the whole set of data into binary numbers, specifying certain cut-off limit value. If a record crosses its value then it might have a value of 1 and if it lies below it would get a value of 0. 