## Data Preprocessing Template in Python
## 1. Rescale data
When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.

Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.
You can rescale your data using scikit-learn using the MinMaxScaler class.

In [3]:
#Rescale data (between 0 and 1)
import pandas as pd
import scipy 
import numpy as np
from sklearn.preprocessing import MinMaxScaler
url="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names=['preg','plas','pres','skin','test','mass','pedi','age','class']
dataframe=pd.read_csv(url,names=names)
array=dataframe.values
X=array[:,0:8]
Y=array[:,8]
scaler=MinMaxScaler(feature_range=(0,1))
rescaledX=scaler.fit_transform(X)
#summarize transformed data
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.353  0.744  0.59   0.354  0.     0.501  0.234  0.483]
 [ 0.059  0.427  0.541  0.293  0.     0.396  0.117  0.167]
 [ 0.471  0.92   0.525  0.     0.     0.347  0.254  0.183]
 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.   ]
 [ 0.     0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]


## 2. Standardization
Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.

You can standardize data using scikit-learn with the StandardScaler class.

In [6]:
#Standardize data(0 mean,1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
url="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names=['preg','plas','pres','skin','test','mass','pedi','age','class']
dataframe=pd.read_csv(url,names=names)
array=dataframe.values
#Separate array into input and output components
X=array[:,0:8]
Y=array[:,8]
scaler=StandardScaler().fit(X)
normalizedX=scaler.transform(X)
#summarize transformed data
np.set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


## 3. Normalize Data
Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.

You can normalize data in Python with scikit-learn using the Normalizer class.

In [7]:
#Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
import pandas as pd
import numpy as np
url="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names=['preg','plas','pres','skin','test','mass','pedi','age','class']
dataframe=pd.read_csv(url,names=names)
array=dataframe.values
#separate array into input and output components
X=array[:,0:8]
Y=array[:,8]
scaler=Normalizer().fit(X)
normalizedX=scaler.transform(X)
#summarize transformed data
np.set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[ 0.034  0.828  0.403  0.196  0.     0.188  0.004  0.28 ]
 [ 0.008  0.716  0.556  0.244  0.     0.224  0.003  0.261]
 [ 0.04   0.924  0.323  0.     0.     0.118  0.003  0.162]
 [ 0.007  0.588  0.436  0.152  0.622  0.186  0.001  0.139]
 [ 0.     0.596  0.174  0.152  0.731  0.188  0.01   0.144]]


## Binarize data(Make Binary)
You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

You can create new binary attributes in Python using scikit-learn with the Binarizer class.

In [9]:
#Binarization
from sklearn.preprocessing import Binarizer
import pandas as pd
import numpy as np
url="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names=['preg','plas','pres','skin','test','mass','pedi','age','class']
dataframe=pd.read_csv(url,names=names)
array=dataframe.values
#Separate array into input and output components
X=array[:,0:8]
Y=array[:,8]
binarizer=Binarizer(threshold=0.0).fit(X)
binaryX=binarizer.transform(X)
#summarize transformed data
np.set_printoptions(precision=3)
print(binaryX[0:5,:])

[[ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  0.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.]
 [ 0.  1.  1.  1.  1.  1.  1.  1.]]
