## Feature Scaling and Normalization - Lab

## Introduction
In this lab, you'll practice your feature scaling and normalization skills!

## Objectives
You will be able to:
* Implement min-max scaling, mean-normalization, log normalization and unit vector normalization in python
* Identify appropriate normalization and scaling techniques for given dataset

## Back to our Boston Housing data

Let's import our Boston Housing data. Remember we categorized two variables and deleted the "NOX" (nitride oxide concentration) variable because it was highly correlated with two other features.

In [1]:
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)

# first, create bins for based on the values observed. 5 values will result in 4 bins
bins = [0, 3, 4 , 5, 24]
bins_rad = pd.cut(boston_features['RAD'], bins)
bins_rad = bins_rad.cat.as_unordered()

# first, create bins for based on the values observed. 5 values will result in 4 bins
bins = [0, 250, 300, 360, 460, 712]
bins_tax = pd.cut(boston_features['TAX'], bins)
bins_tax = bins_tax.cat.as_unordered()

tax_dummy = pd.get_dummies(bins_tax, prefix="TAX")
rad_dummy = pd.get_dummies(bins_rad, prefix="RAD")
boston_features = boston_features.drop(["RAD","TAX"], axis=1)
boston_features = pd.concat([boston_features, rad_dummy, tax_dummy], axis=1)
boston_features = boston_features.drop("NOX",axis=1)

## Look at the histograms for the continuous variables

In [2]:
df = boston_features
cnts = df[df.columns.drop(list(df.filter(regex = "RAD")))]

In [3]:
cnts = cnts[cnts.columns.drop(list(df.filter(regex = "TAX")))]
cnts = cnts.drop(["CHAS"], axis = 1)

In [4]:
cnts.hist(figsize = (9,9))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a1f5ac6a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1f58eef0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1f8cbb00>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a1f9010b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1f926630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1f94dba8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a1f97e160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1f9a3710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1f9a3748>]],
      dtype=object)

## Perform log transformations for the variables where it makes sense

Analyze the results in terms of how they improved the normality performance. What is the problem with the "ZN" variable?  

In [5]:
import numpy as np
df_log = cnts
df_log["CRIM_log"] = np.log(cnts["CRIM"])

In [6]:
df_log["B_log"] = np.log(cnts["B"])
df_log["AGE_log"] = np.log(cnts["AGE"])

In [7]:
df_log_2 = df_log.drop("CRIM", axis = 1).drop("AGE", axis = 1).drop("B", axis = 1)
df_log_2.head()

Unnamed: 0,ZN,INDUS,RM,DIS,PTRATIO,LSTAT,CRIM_log,B_log,AGE_log
0,18.0,2.31,6.575,4.09,15.3,4.98,-5.064036,5.983684,4.177459
1,0.0,7.07,6.421,4.9671,17.8,9.14,-3.600502,5.983684,4.368181
2,0.0,7.07,7.185,4.9671,17.8,4.03,-3.601235,5.973377,4.112512
3,0.0,2.18,6.998,6.0622,18.7,2.94,-3.430523,5.977949,3.824284
4,0.0,2.18,7.147,6.0622,18.7,5.33,-2.672924,5.983684,3.992681


MEG: The problem with the column ZN is that a lot of the rows contain zero values. A zero value cannot be logged therefore the log cannot be found for any of this column.

"ZN" has a lot of zeros (more than 50%!). Remember that this variable denoted: "proportion of residential land zoned for lots over 25,000 sq.ft.". It might have made sense to categorize this variable to "over 25,000 feet or not (binary variable 1/0). Now you have a zero-inflated variable which is cumbersome to work with.

## Try different types of transformations on the continuous variables

Store your final features in a dataframe `features_final`

In [8]:
features_final_a = df_log_2.drop("ZN", axis = 1)
features_final_a.head()

Unnamed: 0,INDUS,RM,DIS,PTRATIO,LSTAT,CRIM_log,B_log,AGE_log
0,2.31,6.575,4.09,15.3,4.98,-5.064036,5.983684,4.177459
1,7.07,6.421,4.9671,17.8,9.14,-3.600502,5.983684,4.368181
2,7.07,7.185,4.9671,17.8,4.03,-3.601235,5.973377,4.112512
3,2.18,6.998,6.0622,18.7,2.94,-3.430523,5.977949,3.824284
4,2.18,7.147,6.0622,18.7,5.33,-2.672924,5.983684,3.992681


MEG: I have decided to log the rest of the dataset so that I have a standard, continuous format for the data to work with. I will then use the standardization technique to standardise all of the data, so that I can see the affect that it has on the dataframe histograms.

In [9]:
col_to_log = features_final_a.columns[0:5]
for col in col_to_log:
    features_final_a["{}_log".format(col)] = np.log(features_final_a["{}".format(col)])
features_final_a = features_final_a.drop(features_final_a.columns[[0,1,2,3,4]], axis = 1)

In [10]:
import sklearn as sk
from sklearn.preprocessing import MinMaxScaler
x = features_final_a.values #returns a numpy array
min_max_scaler = MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
features_final_b = pd.DataFrame(x_scaled)

In [11]:
features_final_b.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,1.0,0.879193,0.393661,0.679532,0.542096,0.348358,0.342318
1,0.153211,1.0,0.933063,0.666533,0.653269,0.623954,0.619906,0.538917
2,0.153134,0.998553,0.860849,0.666533,0.777845,0.623954,0.619906,0.273789
3,0.171005,0.999195,0.779439,0.379532,0.748622,0.707895,0.708405,0.171688
4,0.250315,1.0,0.827003,0.379532,0.771968,0.707895,0.708405,0.364308


MEG: This was great for standardising the set but now the data set is missing column headers! So i have to replace those.

In [28]:
features_final_a.columns

Index(['CRIM_log', 'B_log', 'AGE_log', 'INDUS_log', 'RM_log', 'DIS_log',
       'PTRATIO_log', 'LSTAT_log'],
      dtype='object')

In [24]:
features_final_b.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,1.0,0.879193,0.393661,0.679532,0.542096,0.348358,0.342318
1,0.153211,1.0,0.933063,0.666533,0.653269,0.623954,0.619906,0.538917
2,0.153134,0.998553,0.860849,0.666533,0.777845,0.623954,0.619906,0.273789
3,0.171005,0.999195,0.779439,0.379532,0.748622,0.707895,0.708405,0.171688
4,0.250315,1.0,0.827003,0.379532,0.771968,0.707895,0.708405,0.364308


MEG: Not sure how to identify which column is which now .. I assume that col 1 before is col 1 after but assuming isn't the best course of action. Ask in the workshop!

## Summary
Great! You've now transformed your final data using feature scaling and normalization, and stored them in the `features_final` dataframe.