# <font color="blue">Lesson 2 - Data Retrieval and Preparation</font>

## Binning, Scaling and Normalization

### Binning the Continuous Variable into Bins

Let's say we have a dataset that contains subject ages. Rather than analyze the ages individually, we would like to group them into the following bins: 
- Under 20
- 20 to 40
- 40 to 60
- Over 60

We can use the pandas cut function to separate this list into bins. 

- x = list to cut  
- bins = number of equal sized bins to create
- right = Indicates whether the bins include the rightmost edge or not. If right == True (the default), then the bins [1,2,3,4] indicate (1,2], (2,3], (3,4]  
- labels = optional list of labels for each bin

Let's first create the data we need: 

In [None]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
bin_names = ['Youth', 'YoungAdult', 'MiddleAge', 'Senior']

Now we can use the cut function to cut our list into labeled bins: 

In [None]:
import pandas as pd
new_cats = pd.cut(ages, bins,labels=bin_names)

pd.value_counts(new_cats)

### Zscore normalization

Zscaling allows us to transform features so that they have a standard normal distribution with a mean of zero and a standard deviation of 1. 

We can use sklearns preprocessing.scale method to scale input dataframes: 

In [None]:
# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing

# load the Iris dataset
iris = load_iris()

In [None]:
# separate the data and target attributes
X = iris.data
y = iris.target

# standardize the data attributes and cast to dataframe
standardized_X = pd.DataFrame(preprocessing.scale(X))
standardized_X.head()

### MinMax Scaling: Scale to between 0 and 1

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()

# Copy the iris data for min-max scaling
data_minmax = X.copy()

# Scale on the copied data and display the first 5 rows
data_minmax = pd.DataFrame(min_max_scaler.fit_transform(data_minmax))
data_minmax.head()

### Normalizing Data
We can also use sklearn's normalize function to tranform values to a range from 0 to 1 on a sample basis, rather than on a feature basis as seen in scaling methods: 

In [None]:
# Normalize the data attributes for the Iris dataset.

# Copy the irisn data for normalizing
X2 = X.copy()

# normalize the data attributes and display the first 5 rows
normalized_X = pd.DataFrame(preprocessing.normalize(???))
normalized_X.???()

## Consider this
How do the values compare between z-scaling, min-max scaling, and normalizing? Use the `describe` method on the created dataframes.