# Feature Scaling and Normalization - Lab

## Introduction
In this lab, you'll practice your feature scaling and normalization skills!

## Objectives
You will be able to:
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Compare the different standardization and normalization techniques
* Use standardization/normalization on features of a dataset

## Back to the Ames Housing data

Let's import our Ames Housing data.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')

ames = pd.read_csv('ames.csv')

## Look at the histograms for the continuous variables

Since there are so many features it is helpful to filter the columns by datatype and number of unique values. A heuristic you might use to select continous variables might be a combination of features that are not object datatypes and have at least a certain amount of unique values.

In [2]:
# Your code here
process = ames.select_dtypes(['int64', 'float64'])
print(len(process.columns))
for column in process.columns: 
    if len(process[column].unique()) < 6:
        process = process.drop(column, axis=1)
process.columns, len(process.columns)

# pd.plotting.scatter_matrix(process)

38


(Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
        'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
        'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
        'LowQualFinSF', 'GrLivArea', 'BedroomAbvGr', 'TotRmsAbvGrd',
        'GarageYrBlt', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
        'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
        'MoSold', 'SalePrice'],
       dtype='object'),
 30)

We can see from our histogram of the contiuous features that there are many examples where there are a ton of zeros. For example, WoodDeckSF (square footage of a wood deck) gives us a positive number indicating the size of the deck and zero if no deck exists. It might have made sense to categorize this variable to "deck exists or not (binary variable 1/0). Now you have a zero-inflated variable which is cumbersome to work with.

Lets drop these zero-inflated variables for now and select the features which don't have this characteristic.

In [3]:
# Select non zero-inflated continuous features as ames_cont
ames_cont = None
for column in process.columns: 
    if (len(process.loc[process[column] == 0]) / len(process[column])) > .35:
        process = process.drop(column, axis=1)

process.columns, len(process.columns)

(Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
        'OverallCond', 'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtUnfSF',
        'TotalBsmtSF', '1stFlrSF', 'GrLivArea', 'BedroomAbvGr', 'TotRmsAbvGrd',
        'GarageYrBlt', 'GarageArea', 'MoSold', 'SalePrice'],
       dtype='object'),
 19)

## Perform log transformations for the variables where it makes sense

In [7]:
# Your code here
# pd.plotting.scatter_matrix(process)
# plt.show()
import numpy as np
import scipy.stats as stats



for column in process.columns:
    st, p = stats.shapiro(process[column])
    if p < 0.005:
        process[column] = process[column].map(lambda x: np.log(x))

## Standardize the continuous variables

Store your final features in a DataFrame `features_final`: 

In [10]:
# Your code here
import numpy as np 

def standardize(val, arr):
    return (val - np.mean(arr)) / np.std(arr)

features_final = pd.DataFrame()
for column in process.columns:
    features_final[column] = process[column].apply(lambda x: standardize(x, process[column]))
features_final.head(), process.head()

(         Id  MSSubClass  LotFrontage   LotArea  OverallQual  OverallCond  \
 0 -6.361248    0.430516    -0.208034 -0.133231     0.684385    -0.440508   
 1 -5.660173   -1.128983     0.409895  0.113442     0.045487     1.884487   
 2 -5.250070    0.430516    -0.084449  0.420061     0.684385    -0.440508   
 3 -4.959098    0.649335    -0.414011  0.103347     0.684385    -0.440508   
 4 -4.733402    0.430516     0.574676  0.878409     1.237824    -0.440508   
 
    YearBuilt  YearRemodAdd  BsmtFinSF1  BsmtUnfSF  TotalBsmtSF  1stFlrSF  \
 0   1.045177      0.877540         NaN        NaN          NaN -0.803570   
 1   0.163448     -0.424183         NaN        NaN          NaN  0.418585   
 2   0.980273      0.829642         NaN        NaN          NaN -0.576560   
 3  -1.873795     -0.715870         NaN        NaN          NaN -0.439287   
 4   0.947796      0.733774         NaN        NaN          NaN  0.112267   
 
    GrLivArea  BedroomAbvGr  TotRmsAbvGrd  GarageYrBlt  GarageArea    Mo

## Summary
Great! You've now got some hands-on practice transforming data using log transforms, feature scaling, and normalization!