## Problem Statement
- Load data set 
- Normalize data
- Standardize data

#### Need for Pre-processing
- Different algorithms make differnt assumptions about your data and may require different transforms. 
- On the otherhand, some algorithms can deliver better results without pre-processing.

- General idea here is that iterate over bunch of data transforms and algorithms and check the performance, and select the appropriate transform and algorithm.....

#### Load Python libraries and dataset

In [None]:
import pandas as pd
from matplotlib import pyplot 

In [None]:
data = pd.read_csv("../data/pima-indians-diabetes.csv")

#### Check Your Data

In [None]:
# check first 20 rows of the dataset
print(data.head(5))

## <span style="color:red">Normalize Data</span>

- Data attributes may have varying scales. 
- In this situation, many ML algorithms may not perform well.  
- To get better results we have to recale data.
- This is referred to as normalization and attributes are often rescaled into the range between 0 and 1. 
- Rescale data can be implemented using scikit-learn **MinMaxScaler** class.

In [None]:
# Import Python library MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
from numpy import set_printoptions
set_printoptions(precision=3)

In [None]:
# separate array into input and output components
data_array = data.values
X = data_array[:,0:8]
y = data_array[:,8]

### Descriptive Statistics data before Normalization

In [None]:
# Check Statistical properties of each attribute
X_data = pd.DataFrame(data=X)
print(X_data.describe())


#### Normalize Data using MinMaxScaler class

In [None]:
# Normalize data
# instantiate MinMaxScaler class
scaler = MinMaxScaler(feature_range=(0, 5))
normalizedX = scaler.fit_transform(X)
# summarize transformed data
set_printoptions(precision=3) # set output precision to three decimal places
# print first 5 rows
print(normalizedX[0:5,:])

### Descriptive Statistics data after Normalization

In [None]:
X_normalized_data = pd.DataFrame(data=normalizedX)
print(X_normalized_data.describe())

#### Above output shows that after rescaling all of the values are in the range between 0 and 1.

## <span style="color:red"> Standardize Data</span>

- Data having Gussian distributions may have differing means and standard deviations
- **Standardization** is a useful technique to transform attributes to a mean of **0** and a standard deviation of **1**.
- It transforms data to a standard Gaussian distribution 
- It is most suitable for techniques that assume a Gaussian distribution in the input variables, such as
    - linear regression, 
    - logistic regression and  
- Standardize data can be implemnted using scikit-learn with the **StandardScaler** class.

### Density distribution of data before Standardization

In [None]:
# Check Statistical properties of each attribute
X_data = pd.DataFrame(data=X)### Density distribution of data after Standardization
X_data.plot(kind='density', subplots=True, layout=(3,3), sharex=False, figsize=(15,15))
pyplot.show()


#### Standardize Data using StandardScaler class

In [None]:
# Import Python Library StandardScaler
from sklearn.preprocessing import StandardScaler

In [None]:
# instantiate StandardScaler class and fit on data 
scaler = StandardScaler().fit(X)
standardizedX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)# set output precision to three decimal places
# print first 5 rows
print(standardizedX[0:5,:])

### Density distribution of data after Standardization

In [None]:
# Check Statistical properties of each attribute
X_standardized_data = pd.DataFrame(data=standardizedX)
X_standardized_data.plot(kind='density', subplots=True, layout=(3,3), sharex=False, figsize=(15,15))
pyplot.show()

#### Above output shows that after Standardizing mean is 0 and standard deviation is 1.

### Summary
- Normalization - ? 
- Standardization - ? 