# Data Preparation for Machine Learning
* Rescale data
* Standardize data
* Normalize data
* Binarize data

## Data Transforms
1. Fit and Multiple Transform
2. Combined Fit-and-Transform

**Fit and Multiple Transform**(prefer mostly this)
1. we call the **fit()** function to prepare the parameters of the transform once on our data.
2. Later we use **transform()** on same data to prepare it for modelling and again on the test or validation dataset or new data that we see in the future.  

### 1 . Rescale data
* Attributes are often rescaled into the range between 0 and 1,this is referred to as **normalization**.
* It is useful for algorithms like gradient descent,that weight inputs like regression and neural networks and algorithms that use **distance measures like K-Nearest Neighbors**.

we can rescale our data using scikit-learn using **MinMaxScaler**

In [2]:
# Rescaling data (between 0 and 1)

from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler
filename = 'pima-indians-diabetes.data.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
dataframe = read_csv(filename,names=names)
array = dataframe.values

# separating array into input and output components
X = array[:,0:8]
Y = array[:,8]

scaler = MinMaxScaler(feature_range=(0,1))
rescaledX = scaler.fit_transform(X)

# summarizing transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[0.353 0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.254 0.183]
 [0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.   ]
 [0.    0.688 0.328 0.354 0.199 0.642 0.944 0.2  ]]


### 2 . Standardize data
* It is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviation to a **standard Gaussian distribution with a mean of 0 and a standard deviation of 1**
* It works best for algorithms like **linear regression**,**logistic regression**,**linear discriminate analysis**.

we can standardize data using scikit-learn with the **StandardScaler** class

In [4]:
# Standardize data(0 mean, 1 stdev)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

## summarizing transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


### 3 . Normalize data
* It refers to rescaling each observation(row) to have a length of 1(unit norm).
* It can be useful for sparse datasets(lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbors.

we can normalize data in python with scikit-learn using the **Normalizer** class.

In [5]:
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer

scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarizing transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])


[[0.034 0.828 0.403 0.196 0.    0.188 0.004 0.28 ]
 [0.008 0.716 0.556 0.244 0.    0.224 0.003 0.261]
 [0.04  0.924 0.323 0.    0.    0.118 0.003 0.162]
 [0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
 [0.    0.596 0.174 0.152 0.731 0.188 0.01  0.144]]


### 4 . Binarize data
* we can transform our data using **binary threshold**.
* All values above the **threshold** are marked **1** and all equal to or below are marked as **0**.
* It can be useful when we have probabilities that we want to make crisp values.

we can create binary attributes in python using scikit-learn with the **Binarizer** class


In [6]:
# binarization
from sklearn.preprocessing import Binarizer

# all values equal or less than 0 are marked 0 and all of those above 0 are marked 1
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)

## summarizing transformed data
set_printoptions(precision=3)
print(binaryX[0:5,:])

[[1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1.]]


## Summary
* learnt how to pre-process data using various methods depending on our requiements.