# Prepare Your Data For Machine Learning

- Many machine learning algorithms make assumptions about your data. 
- It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use.

- In this notebook you will discover how to prepare your data for machine learning in Python using scikit-learn. 
- After completing this lesson you will know how to:

   1. Rescale data.
   2. Standardize data.
   3. Normalize data.
   4. Binarize data.

## Read file using pandas

In [2]:
# Rescale data (between 0 and 1)
from pandas import read_csv

filename = 'data/05/data.csv'
data_cancer = read_csv(filename)

## Extract data from dataFrame

In [3]:
data_extracted = data_cancer.iloc[:, 2:8]
data_extracted

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean
0,17.99,10.38,122.80,1001.0,0.11840,0.27760
1,20.57,17.77,132.90,1326.0,0.08474,0.07864
2,19.69,21.25,130.00,1203.0,0.10960,0.15990
3,11.42,20.38,77.58,386.1,0.14250,0.28390
4,20.29,14.34,135.10,1297.0,0.10030,0.13280
...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590
565,20.13,28.25,131.20,1261.0,0.09780,0.10340
566,16.60,28.08,108.30,858.1,0.08455,0.10230
567,20.60,29.33,140.10,1265.0,0.11780,0.27700


In [34]:
array = data_extracted.values

# separate array into input and output components
X = array[:,:]
Y = array[:,:]

 ## Rescale Data
 
 
- When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale. 
- Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1.


In [35]:
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[0.521 0.023 0.546 0.364 0.594 0.792]
 [0.643 0.273 0.616 0.502 0.29  0.182]
 [0.601 0.39  0.596 0.449 0.514 0.431]
 [0.21  0.361 0.234 0.103 0.811 0.811]
 [0.63  0.157 0.631 0.489 0.43  0.348]]


## Standardize Data

- Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1

In [36]:
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 1.097 -2.073  1.27   0.984  1.568  3.284]
 [ 1.83  -0.354  1.686  1.909 -0.827 -0.487]
 [ 1.58   0.456  1.567  1.559  0.942  1.053]
 [-0.769  0.254 -0.593 -0.764  3.284  3.403]
 [ 1.75  -1.152  1.777  1.826  0.28   0.539]]


## Normalize Data

- Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm or a vector with the length of 1 in linear algebra).
- This pre-processing method can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbors. You can normalize data in Python with scikit-learn using the Normalizer class.

In [37]:
from sklearn.preprocessing import Normalizer


scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[1.783e-02 1.029e-02 1.217e-01 9.923e-01 1.174e-04 2.752e-04]
 [1.543e-02 1.333e-02 9.971e-02 9.948e-01 6.357e-05 5.900e-05]
 [1.627e-02 1.756e-02 1.074e-01 9.939e-01 9.055e-05 1.321e-04]
 [2.895e-02 5.166e-02 1.966e-01 9.787e-01 3.612e-04 7.196e-04]
 [1.556e-02 1.099e-02 1.036e-01 9.944e-01 7.690e-05 1.018e-04]]


## Binarize Data (Make Binary)

- You can transform your data using a binary threshold. 
- All values above the threshold are marked 1 and all equal to or below are marked as 0.
- This is called binarizing your data or thresholding your data

In [38]:
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(binaryX[0:5,:])

[[1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]]
