## Sample Analysis

In [1]:
boston <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data')

In [2]:
install.packages("dplyr")
library("dplyr", warn.conflicts = FALSE)

“S3 methods ‘[.fun_list’, ‘[.grouped_df’, ‘all.equal.tbl_df’, ‘anti_join.data.frame’, ‘anti_join.tbl_df’, ‘arrange.data.frame’, ‘arrange.default’, ‘arrange.grouped_df’, ‘arrange.tbl_df’, ‘arrange_.data.frame’, ‘arrange_.tbl_df’, ‘as.data.frame.grouped_df’, ‘as.data.frame.rowwise_df’, ‘as.data.frame.tbl_cube’, ‘as.data.frame.tbl_df’, ‘as.table.tbl_cube’, ‘as.tbl.data.frame’, ‘as.tbl.tbl’, ‘as.tbl_cube.array’, ‘as.tbl_cube.data.frame’, ‘as.tbl_cube.matrix’, ‘as.tbl_cube.table’, ‘as_data_frame.grouped_df’, ‘as_data_frame.tbl_cube’, ‘auto_copy.tbl_cube’, ‘auto_copy.tbl_df’, ‘cbind.grouped_df’, ‘collapse.data.frame’, ‘collect.data.frame’, ‘common_by.NULL’, ‘common_by.character’, ‘common_by.default’, ‘common_by.list’, ‘compute.data.frame’, ‘copy_to.DBIConnection’, ‘copy_to.src_local’, ‘default_missing.data.frame’, ‘default_missing.default’, ‘dim.tbl_cube’, ‘distinct.data.frame’, ‘distinct.default’, ‘distinct.grouped_df’, ‘distinct.tbl_df’, ‘distinct_.data.frame’, ‘distinct_.grouped_df’, ‘dis

ERROR: Error: package or namespace load failed for ‘dplyr’


In [None]:
install.packages("psych")
library("psych")

In [None]:
library(ggplot2)
library(reshape2)

In [None]:
features <- c('CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV')
boston_stats <- data.frame(features)
stats <- describe(boston)
stats$vars <- NULL; stats$trimmed <- NULL; stats$mad <- NULL;
boston_stats <- cbind(boston_stats, stats)
boston_stats

#### Data subset (n = 5)

In [None]:
randomSample = function(df, n){
    return (df[sample(nrow(df), n),])
}

In [None]:
samp <- randomSample(boston,5)


In [None]:
samp_stats <- describe(samp)
samp_stats <- cbind(features, samp_stats)
samp_stats$vars <- NULL ; samp_stats$trimmed <- NULL ; samp_stats$mad <- NULL
samp_stats

#### Z-scores of sample data set

Z-score = $(X - \mu)/\sigma $

where $X$ = sample mean, and $\mu$ = population mean, and $\sigma$ = population standard dev

In [None]:
z_score <- function (X, mu, sd){
    z <- (X - mu)/sd
    return(z)
}

In [None]:
samp_z_score <- z_score(samp_stats$mean, boston_stats$mean, boston_stats$sd)
samp_z_df <- data.frame(features, samp_z_score)


In [None]:
samp_z_df1 <- subset(samp_z_df, samp_z_score >= 0)
samp_z_df2 <- subset(samp_z_df, samp_z_score <0)


#### Bar Plot

Z-scores for the sample means.
Z-scores close to 0 indicate that the sample mean is close to the population mean. 

In [None]:
ggplot() +
    geom_bar(data=samp_z_df1, aes(x=features, y=samp_z_score), stat = "identity") +
    geom_bar(data=samp_z_df2, aes(x=features, y=samp_z_score), stat = "identity") +
    theme_classic()

#### Heat Map

For a heat map: X-axis feature, Y-axis instances, Z = color

In [None]:
colnames(samp) <- features
samp
rownames(samp)

In [None]:
ggplot(samp) +
    geom_tile(aes(fill=z))

### Comparing Values

Before scaling our data to Z-scores, we can not compare values across categories (different means and standard deviations). Ex: we cannot compare ages with proportions, since they are not in the same scale.

However, by taking the Z-score of the sample, we can compare each feature of the sample since they now have means around 0, and standard deviations of 1. This places each feature on the same scale. 