### Deskew and Scale

For numerical features, we can look at the distributions, observe the skew and correct them with de-skewing methods. 

In [25]:
source('../src/load_data.r')
source('../src/multiplot.r')

In [26]:
library(psych)

In [27]:
t(dim(numeric_df))
head(numeric_df)

0,1
1451,23


LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,FirstFlrSF,⋯,GarageYrBlt,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,ThreeSsnPorch,ScreenPorch,PoolArea,MiscVal,YrSold
65,8450,2003,2003,196,706,0,150,856,856,⋯,2003,548,0,61,0,0,0,0,0,2008
80,9600,1976,1976,0,978,0,284,1262,1262,⋯,1976,460,298,0,0,0,0,0,0,2007
68,11250,2001,2002,162,486,0,434,920,920,⋯,2001,608,0,42,0,0,0,0,0,2008
60,9550,1915,1970,0,216,0,540,756,961,⋯,1998,642,0,35,272,0,0,0,0,2006
84,14260,2000,2000,350,655,0,490,1145,1145,⋯,2000,836,192,84,0,0,0,0,0,2008
85,14115,1993,1995,0,732,0,64,796,796,⋯,1993,480,40,30,0,320,0,0,700,2009


In [28]:
t(colnames(numeric_df))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,FirstFlrSF,⋯,GarageYrBlt,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,ThreeSsnPorch,ScreenPorch,PoolArea,MiscVal,YrSold


### Managing 0s: Numeric Features Processing

Here we are determining which numeric features contain zeros. For features that are predominately 0, we might consider turning that numeric feature into a categorical feature by either changing them to bools or by binning them.

#### >=2/3 of values are 0s

`mostly_zeros` : If >=2/3 of the values are 0s, we will turn the feature into **booleans**.

#### Between >1/3 and <2/3 of the values are 0s

`some_zeros` : If between >1/3 and <2/3 of the values are 0s, we will **bin** that feature. 

The binning method we will use creates a bin for all 0 values, then determines to range by subtracting the `min` from the `max`. The range is then divided into 4 bins, which the rest of the values will be placed.

#### <=1/3 of the values are 0s

`few_zeros` : If <=1/3 of the values are 0s, we will leave the feature as is.


In [29]:
has_zeros <- function(dataframe){
    column_sums = colSums(dataframe == 0)
    return(t(column_sums))
}

has_zeros(numeric_df)

LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,FirstFlrSF,⋯,GarageYrBlt,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,ThreeSsnPorch,ScreenPorch,PoolArea,MiscVal,YrSold
0,0,0,0,860,464,1284,118,37,0,⋯,0,81,755,653,1244,1427,1335,1444,1399,0


In [41]:
get_mostly_zeros <- function(dataframe){
    mostly_zeros_df <- has_zeros(dataframe) >= (1451/3*2) 
    return(dataframe[,mostly_zeros_df])
}

get_some_zeros <- function(dataframe){
    some_zeros_df <- has_zeros(dataframe) > (1451/3) & has_zeros(dataframe) <(1451/3*2)
    return(dataframe[,some_zeros_df])
}

get_few_zeros <- function(dataframe){
    few_zeros_df <- has_zeros(dataframe) <= (1451/3) 
    return(dataframe[,few_zeros_df])
}

In [44]:
mostly_zeros <- get_mostly_zeros(numeric_df)
some_zeros <- get_some_zeros(numeric_df)
few_zeros <- get_few_zeros(numeric_df)

t(colnames(mostly_zeros))
t(colnames(some_zeros))
t(colnames(few_zeros))

0,1,2,3,4,5,6
BsmtFinSF2,LowQualFinSF,EnclosedPorch,ThreeSsnPorch,ScreenPorch,PoolArea,MiscVal


0,1,2,3
MasVnrArea,SecondFlrSF,WoodDeckSF,OpenPorchSF


0,1,2,3,4,5,6,7,8,9,10,11
LotFrontage,LotArea,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF,FirstFlrSF,GrLivArea,GarageYrBlt,GarageArea,YrSold


In [53]:
turn_numeric_to_bool <- function(feature){
    
    return(bool_feat)
}

ERROR: Error in feature[, feature != 0]: incorrect number of dimensions


In [None]:
describe(numeric_mostlynotzeros_df)

In [None]:
summary(numeric_mostlyzeros_df)

In [None]:
turn_numeric_to_bin <- function(feature){
    temp_df <- feature[feature !=0]
    maxval <- max(temp_df)
    minval <- min(temp_df)
    quart <- (maxval-minval)/4
    binvals <- c(-1,minval,minval+quart,minval+2*quart,maxval-quart,maxval)
    return(binvals)
}


In [None]:
turn_numeric_to_cat <- function(feature){
    binval <- turn_numeric_to_bin(feature)
    new_feat <- as.factor(.bincode(feature,binval))
    binval_string <- toString(binval)
    levels(new_feat) <- c("0","bin1","bin2","bin3","bin4" )
    return(barplot(table(new_feat)))
}

turn_numeric_to_cat(numeric_df$ScreenPorch)