## Outlier Analysis

We use Tukey's method to identify each feature's outliers. Then we combine and tally a list to identify instances that are outlier for more than one feature.

In [1]:
library(MASS)

In [2]:
data(Boston)

In [3]:
dim(Boston)

In [4]:
Boston$medv <- NULL

In [5]:
Boston$zn <- Boston$zn + 1
Boston$chas <- Boston$chas + .5
Boston_log <- log(Boston)
Boston_log_scaled <- data.frame(scale(Boston_log))

In [6]:
display_outliers <- function (dataframe, feature, param=1.5) {
    feature_vec =  as.vector(dataframe[[feature]])
    Q1 <- quantile(feature_vec, .25)
    Q3 <- quantile(feature_vec, .75)
    tukey_window <- param*(Q3-Q1)
    less_than_Q1 <- dataframe[[feature]] < Q1 - tukey_window
    greater_than_Q3 <- dataframe[[feature]] > Q3 + tukey_window
    tukey_mask <- (less_than_Q1 | greater_than_Q3)
    return(dataframe[tukey_mask,])
}

In [7]:
for (feature in colnames(Boston_log_scaled)){
    outlier_count = dim(display_outliers(Boston_log_scaled, feature))[1]
    print(paste(feature, outlier_count))
}

[1] "crim 0"
[1] "zn 0"
[1] "indus 2"
[1] "chas 35"
[1] "nox 0"
[1] "rm 27"
[1] "age 17"
[1] "dis 0"
[1] "rad 0"
[1] "tax 0"
[1] "ptratio 16"
[1] "black 78"
[1] "lstat 1"


In [8]:
raw_outliers = c()
for (feature in colnames(Boston_log_scaled)){
    outlier_df = display_outliers(Boston_log_scaled, feature)
    outlier_indices = rownames(outlier_df)
    raw_outliers = c(raw_outliers, outlier_indices)
}
raw_outliers

In [9]:
raw_outliers_df <- data.frame(table(raw_outliers))

In [10]:
single_feature_outliers <- raw_outliers_df[raw_outliers_df$Freq == 1,]

In [11]:
multi_feature_outliers <- raw_outliers_df[raw_outliers_df$Freq != 1,]

In [12]:
multi_feature_outliers$raw_outliers

In [13]:
length(single_feature_outliers$raw_outliers)/dim(Boston)[1]*100

In [14]:
length(multi_feature_outliers$raw_outliers)/dim(Boston)[1]*100

### Outlier Percentages

From the analysis we can conclude the following:

    1) 28.66% of instances are outlier for only a single feature
    2) 2.96% of instances account for outlier of more than one feature (mostly 2 features)
    
We can drop the points accounting for the 2.96%, however we should be more careful about the 28.66% as it is almost 1/3 of the data points. We could always drop these instances

### Strategy for Handling Outliers

We can drop the points accounting for the 2.96% **(2)**, however we should examine the 28.66% **(1)** before we drop them. This **(1)** group consists of almost 1/3 of the data points, and could be a large factor in forming our model. We can always explore and prepare 2 models to compare, one with and one without these outliers.