# Identifying and Removing Outliers

To identify outliers in the data, we will use what is [the Tukey Method](http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/). 

This means that we will look for points that are more than 1.5 times the Inter-quartile range above the third quartile or below the first quartile.

In [1]:
customer_df = read.csv('Wholesale_customers_data.csv')
customer_df$Channel <- NULL
customer_df$Region <- NULL
dim(customer_df)

In [2]:
customer_log_df = log(customer_df)
customer_log_sc_df = data.frame(scale(customer_log_df))

In [3]:
display_outliers <- function (dataframe, feature, param=1.5) {
    feature_vec =  as.vector(dataframe[[feature]])
    Q1 <- quantile(feature_vec, .25)
    Q3 <- quantile(feature_vec, .75)
    tukey_window <- param*(Q3-Q1)
    less_than_Q1 <- dataframe[[feature]] < Q1 - tukey_window
    greater_than_Q3 <- dataframe[[feature]] > Q3 + tukey_window
    tukey_mask <- (less_than_Q1 | greater_than_Q3)
    return(dataframe[tukey_mask,])
}

In [4]:
display_outliers(customer_log_sc_df, 'Grocery')

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
76,0.8058045,-1.003267,-6.578339,0.8482054,-3.304645,0.165776
155,-1.5523603,-3.804185,-3.154701,-2.3229387,-2.812321,-3.498307


In [5]:
display_outliers(customer_log_sc_df, 'Milk')

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
87,0.8847136,2.851919,1.734391,-0.3166314,1.813465,0.1072521
99,-1.695833,-3.14653,-1.598717,-0.3928037,-1.60406,-1.359695
155,-1.5523603,-3.804185,-3.154701,-2.3229387,-2.812321,-3.4983071
357,0.8776329,-2.980683,-2.738534,0.5885233,-2.666295,-0.2737634


In [6]:
for (feature in colnames(customer_log_sc_df)){
    outlier_count = dim(display_outliers(customer_log_sc_df, feature))[1]
    print(paste(feature, outlier_count))
}

[1] "Fresh 16"
[1] "Milk 4"
[1] "Grocery 2"
[1] "Frozen 10"
[1] "Detergents_Paper 2"
[1] "Delicatessen 14"


What if we count the rows that show up as an outlier more than once?

In [7]:
raw_outliers = c()
for (feature in colnames(customer_log_sc_df)){
    outlier_df = display_outliers(customer_log_sc_df, feature)
    outlier_indices = rownames(outlier_df)
    raw_outliers = c(raw_outliers, outlier_indices)
}
raw_outliers

In [8]:
table(raw_outliers)

raw_outliers
110 129 138 143 146 155 162 172 176 184 185 188 194 204 219 234 265 286 290 305 
  1   2   1   1   1   3   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
306 326 339 344 354 356 357 358  39 413 421 430 440  58  66  67  76  82  87  96 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   2   2   1   1   1 
 97  99 
  1   1 

In [9]:
dim(customer_log_sc_df)