<img style="width:450px;" src="https://durhamcollege.ca/wp-content/uploads/ai-hub-header.jpg" alt="DC Logo"/>

# LESSON 5 - Pre-Processing Data
## <span style="color: green">OVERVIEW</span>

Many machine learning algorithms make assumptions about your data.

It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use.

In this post you will discover how to prepare your data for machine learning in Python using the scikit-learn library. Which contains tools for data mining and analysis that are built on NumPy, SciPy, and matplotlib.


### The Need for Data Processing

You almost always need to pre-process your data. It is a required step.

A common difficulty when processing is that different algorithms make different assumptions about your data and may require different transforms. Further, when you follow all of the rules and prepare your data, sometimes algorithms can still deliver better results without the pre-processing.

Generally, I would recommend creating many different views and transforms of your data, then exercise a handful of algorithms on each view of your dataset. This will help you to flush out which data transforms might be better at exposing the structure of your problem in general.

### Pre-Processing Machine Learning Recipes

**This section lists 4 different data preprocessing recipes for machine learning.**

All of the recipes were designed to be complete and standalone.

You can copy and paste them directly into your project and start working.

**Note:** 

- <span style="color: blue">**Answer any questions in bold and blue in the code block(cell) below each section.** </span>
  
- <span style="color: green">*Any statements in italic and green are for consideration and should help guide you to understand the code involved.* </span>

**The Pima Indian diabetes dataset is used in each recipe.**

This is a binary classification problem where all of the attributes are numeric and have different scales. It is a great example of dataset that can benefit from pre-processing.

**You can learn more about this data set on the UCI Machine Learning Repository webpage.**
- (https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes)

**Each recipe follows the same structure:**

1. Load the dataset from a URL.
2. Split the dataset into the input and output variables for machine learning.
3. Apply a preprocessing transform to the input variables.
4. Summarize the data to show the change.
5. The transforms are calculated in such a way that they can be applied to your    training data and any samples of data you may have in the future.

**The *scikit-learn* documentation has some information on how to use various different preprocessing methods.**
- (http://scikit-learn.org/stable/modules/preprocessing.html) 

**You can review the preprocess API in scikit-learn here.**
- (http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)

## <span style="color: green">SECTION 1</span>
### Re-scale Data

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.

Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.

**You can rescale your data using scikit-learn using the *MinMaxScaler* class.**
- (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

In [2]:
import pandas
import scipy
import numpy

# Rescale data (between 0 and 1)
from sklearn.preprocessing import MinMaxScaler

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

scaler = MinMaxScaler(feature_range=(0, 1))

rescaledX = scaler.fit_transform(X)

# display a summary of the Rescaled data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.353  0.744  0.59   0.354  0.     0.501  0.234  0.483]
 [ 0.059  0.427  0.541  0.293  0.     0.396  0.117  0.167]
 [ 0.471  0.92   0.525  0.     0.     0.347  0.254  0.183]
 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.   ]
 [ 0.     0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]


***After rescaling you can see that all of the values are in the range between 0 and 1.***

## <span style="color: green">SECTION 2</span>
### Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
- (https://en.wikipedia.org/wiki/Normal_distribution)

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.


**You can standardize data using *scikit-learn* with the *StandardScaler* class.**
- (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [4]:
import pandas
import numpy

# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values

# Separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# display a summary of the Standardized data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


***The values for each attribute now have a mean value of 0 and a standard deviation of 1.***

## <span style="color: green">SECTION 3</span>
### Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.

**You can normalize data in Python with *scikit-learn* using the *Normalizer* class.**
- (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html)

In [7]:
import pandas
import numpy

# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

# create new scaler for transformation
# using the sklearn normalizer function
xscaler = Normalizer().fit(X)

normalizedX = xscaler.transform(X)

# set the print values to 3 decimal places
numpy.set_printoptions(precision=3)
# display a summary of the Normalized data
print(normalizedX[0:5,:])

[[ 0.034  0.828  0.403  0.196  0.     0.188  0.004  0.28 ]
 [ 0.008  0.716  0.556  0.244  0.     0.224  0.003  0.261]
 [ 0.04   0.924  0.323  0.     0.     0.118  0.003  0.162]
 [ 0.007  0.588  0.436  0.152  0.622  0.186  0.001  0.139]
 [ 0.     0.596  0.174  0.152  0.731  0.188  0.01   0.144]]


***The rows are normailzed to length 1.***

## <span style="color: green">SECTION 4</span>
### Binarize Data - Make it Binary 
***also known as One-Hot Encoding***

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

**You can create new binary attributes in Python using *scikit-learn* with the *Binarizer* class.**
- (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html)

In [7]:
import pandas
import numpy

# Binarize data (0 for lower threshold or 1 for upper threshold)
from sklearn.preprocessing import Binarizer

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

binarizer = Binarizer(threshold=0.0).fit(X)

binaryX = binarizer.transform(X)

# display a summary of the Binarized data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])

[[ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  0.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.]
 [ 0.  1.  1.  1.  1.  1.  1.  1.]]


***You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.***

## <span style="color: green">SUMMARY</span>

In this post you discovered how you can prepare your data for machine learning in Python using ***scikit-learn***.

**You now have recipes to:**

- **Re-scale** data.
- **Standardize** data.
- **Normalize** data.
- **Binarize** data.

<center>**Now it is your turn to practice data pre-processing in *scikit-learn*!**</center>
 

## <span style="color: green">CHALLENGE</span>

<span style="color: blue">**For each of the 4 Pre-Processing Techniques Above**</span>

- *Assess the 'Occupancy' dataset*
- *Provided in the **'datatraining.txt'** file*

In [2]:
import pandas as pd
import numpy as np
import scipy as sp

# file = 'datatraining.txt'
# occupancy_data = pd.read_csv(file)

### <span style="color: blue">Re-scale a *new* dataframe from the Occupancy data</span>
- Reference the dataset and turn it into a dataframe with relevant column identities
- Select the columns valid for Rescaling in a subframe

In [3]:
# Rescale data (between 0 and 1)
from sklearn.preprocessing import MinMaxScaler


- Determine a viable range for re-scaling the data in your new subframe (experiment)
- Use sklearn to Re-scale the subframe based on your determined range
- Display the changes in a summary

In [4]:
# separate or assign subframe components

# display a summary of the Rescaled data


### <span style="color: blue">Standardize a *new* dataframe from the Occupancy data</span>
- Reference the dataset and turn it into a dataframe with relevant column identities
- Select the columns valid for Standardization in a subframe

In [5]:
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler


- Use sklearn to Standardize the subframe based on a mean of 0 and stdev of 1 (experiment)
- Display the changes in a summary

In [6]:
# separate or assign subframe components

# display a summary of the Standardized data


### <span style="color: blue">Normalize a *new* dataframe from the Occupancy data</span>
- Reference the dataset and turn it into a dataframe with relevant column identities
- Select the columns valid for Normalization in a subframe

In [7]:
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer


- Use sklearn to Normalize the subframe based on a length of 1 (experiment)
- Display the changes in a summary

In [8]:
# separate or assign subframe components

# display a summary of the Normalized data


### <span style="color: blue">Binarize a *new* dataframe from the Occupancy data</span>
- Reference the dataset and turn it into a dataframe with relevant column identities
- Select the columns valid for Binarization in a subframe

In [9]:
# Binarize data (0 for lower threshold or 1 for upper threshold)
from sklearn.preprocessing import Binarizer


- Determine a viable threshold for binarizing the data in your new subframe (experiment)
- Use sklearn to Binarize the subframe based on your determined threshold's
- Display the changes in a summary

In [10]:
# separate or assign subframe components

# display a summary of the Binarized data
