## Outlier Analysis

We will be using the Tukey Method to remove outliers.
That is, we will examine the instances with values that lie 1.5\*IQR below the 1st quartile or above the 3rd quartile. 

In [27]:
boston_log_scaled <- read.csv('boston_log_scaled.csv')
boston_log_scaled['X'] <- NULL
dim(boston_log_scaled)

In [28]:
head(boston_log_scaled)

CRIM,INDUS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
-1.9812674,-1.7026586,-0.04899171,0.4576943,0.14188774,0.4086998,-2.1348765,-0.6081371,-1.4425494,0.3022931,-1.2739995
-1.3043481,-0.2629788,-0.73021967,0.2466931,0.45416732,0.7688007,-1.3425573,-1.1163074,-0.2300505,0.3022931,-0.2634499
-1.3046869,-0.2629788,-0.73021967,1.2475534,0.03554534,0.7688007,-1.3425573,-1.1163074,-0.2300505,0.288967,-1.6262487
-1.2257286,-1.7772063,-0.8480142,1.0127782,-0.43638638,1.1380675,-0.8790803,-1.3339352,0.1651151,0.2948776,-2.1510637
-0.8753211,-1.7772063,-0.8480142,1.2003437,-0.1606607,1.1380675,-0.8790803,-1.3339352,0.1651151,0.3022931,-1.1609652
-1.2632149,-1.7772063,-0.8480142,0.2591629,-0.03006705,1.1380675,-0.8790803,-1.3339352,0.1651151,0.2932056,-1.1988612


In [23]:
display_outliers <- function(df, col, param = 1.5){
    feature <- as.vector(df[[col]])
    Q1 <- quantile(feature, 0.25)
    Q3 <- quantile(feature, 0.75)
    tukey_window <- param*(Q3-Q1)
    low_bound <- df[[col]] < Q1 - tukey_window
    up_bound <- df[[col]] > Q3 + tukey_window
    tukey_mask <- (low_bound | up_bound)
    return(df[tukey_mask,])
}

In [30]:
display_outliers(boston_log_scaled, 'INDUS')

Unnamed: 0,CRIM,INDUS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
57,-1.435886,-3.167744,-1.3975,0.1938494,-0.8443067,1.908681,-1.3425573,-0.4672479,-0.4583125,0.3022931,-1.02896
196,-1.619725,-3.779625,-1.254321,2.0639136,-1.0234577,1.007031,-0.5502381,-0.9842937,-1.9282409,0.2935664,-2.134168


`INDUS` has 2 outliers at indices `57` and `196`.

In [31]:
display_outliers(boston_log_scaled, 'CRIM')

CRIM,INDUS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT


`CRIM` does not have any values that fall out of the tukey window.

In [32]:
for (feature in colnames(boston_log_scaled)){
    outlier_count <- dim(display_outliers(boston_log_scaled, feature))[1]
    paste(feature, outlier_count)
}

In [34]:
raw_outliers = c()
for (feature in colnames(boston_log_scaled)){
    outlier_df <- display_outliers(boston_log_scaled, feature)
    outlier_indices <- rownames(outlier_df)
    raw_outliers <- c(raw_outliers, outlier_indices)
}

In [40]:
outlier_counts <- as.data.frame(table(raw_outliers))
head(outlier_counts)

raw_outliers,Freq
103,1
116,1
119,1
135,1
145,1
146,1


In [50]:
mult_outlier <- outlier_counts[which(outlier_counts$Freq > 1),]

head(mult_outlier)

Unnamed: 0,raw_outliers,Freq
42,254,2
43,258,2
49,263,2
54,268,2
57,284,2
66,368,2


   This shows us at which indices in the `boston_log_scaled` dataframe have instances where there were more than 1 outlier in that row.