# Top 10 Features

The following will apply a linear regression to find the top ten features in the Ames, Iowa dataset. It is acknowledged that this is not the ideal model for this dataset, but will be used for learning purposes.

In [1]:
library(caret)
library(ggplot2)
library(dplyr)
library(broom)
set.seed(100)

Loading required package: lattice
Loading required package: ggplot2

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [2]:
### Loading in Data Set - Refer to 01. EDA for more details
ames_URL <- 'https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt'
ames <- read.table(url(ames_URL), sep = '\t', header = TRUE)

ames$Order <- NULL
ames$PID <- NULL

to_be_factors <- c("MS.SubClass", "Overall.Qual", "Overall.Cond", "Bsmt.Full.Bath", "Bsmt.Half.Bath", "Full.Bath", 
                   "Half.Bath", "Bedroom.AbvGr", "Kitchen.AbvGr", "TotRms.AbvGrd", "Fireplaces", "Garage.Cars", "Mo.Sold",
                   "Yr.Sold", "Year.Built", "Year.Remod.Add")
ames[to_be_factors] <- lapply(ames[to_be_factors], factor)

ames$Lot.Frontage[is.na(ames$Lot.Frontage)] <- mean(ames$Lot.Frontage, na.rm=TRUE)
ames$Mas.Vnr.Area[is.na(ames$Mas.Vnr.Area)] <- mean(ames$Mas.Vnr.Area, na.rm=TRUE)
ames$Garage.Yr.Blt[is.na(ames$Garage.Yr.Blt)] <- mean(ames$Garage.Yr.Blt, na.rm=TRUE)

empty_means_without <-c("Alley","Bsmt.Qual","Bsmt.Cond","Bsmt.Exposure","BsmtFin.Type.1", "BsmtFin.Type.2", "Fireplace.Qu",
                        "Garage.Type","Garage.Finish", "Garage.Qual","Garage.Cond","Pool.QC","Fence","Misc.Feature")

replace_empty_with_without <- function(feature) {
    levels(feature) <- c(levels(feature), "Without")
    feature[is.na(feature)] <- "Without"
    return(feature)
}

for (feature in empty_means_without) {
    ames[,feature] <- replace_empty_with_without(ames[,feature])
}

ames <- na.omit(ames)

dummy <- dummyVars(" ~ .", data = ames)
ames <- data.frame(predict(dummy, newdata = ames))

Let's begin by creating dummy train and test data sets.

In [3]:
in_train <- createDataPartition(y = ames$SalePrice, p = 0.6, list = FALSE)
train <- ames[in_train,]
test <- ames[-in_train,]

### Linear Regression

To determine the top ten features from the regression, I will look at the standardized regression coefficients, or the predictor variables with the largest absolute value for the standardized coefficient. Regular regression coefficients and p-values will not be used. This is because larger coefficients and low p-values don’t necessarily identify more important predictor variables.

In [4]:
train_S <- data.frame(scale(train)) # standardize training set
with_NA <- colnames(train_S)[colSums(is.na(train_S)) > 0] # find all columns with NA's (note they were originally 0 anyways)
train_S[,with_NA] <- NULL # delete them

In [5]:
fit <- lm(train_S$SalePrice ~ ., data = train_S) # Linear Regression
lm_coefs <- tidy(fit)
head(lm_coefs)

term,estimate,std.error,statistic,p.value
(Intercept),7.021283e-15,0.005472219,1.283078e-12,1.0
MS.SubClass.20,-0.05066681,0.102820298,-0.4927705,0.6222601
MS.SubClass.30,-0.02044754,0.049363703,-0.4142222,0.6787815
MS.SubClass.40,-0.008863433,0.012623602,-0.7021319,0.4827259
MS.SubClass.45,0.004142154,0.029193386,0.1418867,0.887192
MS.SubClass.50,0.01957658,0.065460561,0.2990591,0.764944


In [9]:
lm_coefs$estimate_absolute <- abs(lm_coefs$estimate) # take the absolute value
head(lm_coefs[order(-lm_coefs$estimate_absolute),],10)

Unnamed: 0,term,estimate,std.error,statistic,p.value,estimate_absolute
389,Bedroom.AbvGr.3,-4.085296,0.391008,-10.44811,1.440375e-24,4.085296
388,Bedroom.AbvGr.2,-3.583326,0.3443377,-10.40643,2.157223e-24,3.583326
404,TotRms.AbvGrd.6,3.505965,0.3109634,11.27453,3.655961e-28,3.505965
405,TotRms.AbvGrd.7,3.261693,0.2899471,11.24927,4.744269e-28,3.261693
403,TotRms.AbvGrd.5,3.131193,0.2767705,11.31332,2.447864e-28,3.131193
390,Bedroom.AbvGr.4,-2.751958,0.2619983,-10.50373,8.385981e-25,2.751958
406,TotRms.AbvGrd.8,2.448115,0.217264,11.26792,3.9138330000000003e-28,2.448115
402,TotRms.AbvGrd.4,2.058674,0.180523,11.40394,9.549338e-29,2.058674
387,Bedroom.AbvGr.1,-1.637976,0.1579914,-10.3675,3.141888e-24,1.637976
407,TotRms.AbvGrd.9,1.57824,0.1404297,11.23865,5.292522e-28,1.57824


The following shows the top ten features that determine Sale Price based on a linear regression. The top features tend to be the number of bedrooms and the total number of rooms in general. The number following it indicates how many bedroom/ rooms there are. If you are interested in just looking at the variables itself (without each individual category), the top ten features are as follows:

1. Bedroom.AbvGrd - Number of bedrooms above grade
2. TotRms.AbvGrd - Total rooms above grade 
3. Overall.Qual - Overall quality of material and finish of house
4. X2nd.Flr.SF - Second floor square feet
5. Bsmt.Full.Bath - Number of full bathrooms in basement 
6. Roof.Style - Type of roof
7. X1st.Flr.SF - First floor square feet
8. BsmtFin.SF - Basement square feet (type 1)
9. Full.Bath -  Full bath above grade
10. Overall.Cond - Ovearll condition of the house