# Extra ways to summarize data

## Summarizing vectors

Vectors are summarized using measures of central tendency and variability. Let's look into other descriptive statistics for summarizing the vectors. 

In [1]:
housing_prices <- read.csv("datasets/kc_house_data.csv")

In [2]:
t(head(housing_prices))

Unnamed: 0,1,2,3,4,5,6
X,1,2,3,4,5,6
id,7129300520,6414100192,5631500400,2487200875,1954400510,7237550310
date,20141013T000000,20141209T000000,20150225T000000,20141209T000000,20150218T000000,20140512T000000
price,221900,538000,180000,604000,510000,1225000
bedrooms,3,3,2,4,3,4
bathrooms,1.00,2.25,1.00,3.00,2.00,4.50
sqft_living,1180,2570,770,1960,1680,5420
sqft_lot,5650,7242,10000,5000,8080,101930
floors,1,2,1,1,1,1
waterfront,0,0,0,0,0,0


apply(), lapply(), sapply(), tapply(), ddply() are some summarizing functions that we can use to apply functions on the columns

## APPLY()

The apply function is used to apply a function to the rows or columns of a matrix. It collapses either a row or a column. 1 represents row and 2 represents column. We'll try to apply the mean function for all columns so we're using 2. 

In [4]:
# Date is a factor variable so we cannot apply the mean function onto a factor variable. The id column is just an identifier. 
# Let's exclude them

less_data <- housing_prices[, !names(housing_prices) %in% c("date", "id")]

In [5]:
apply(less_data, 2, mean)

In [10]:
colMeans(less_data) # much faster

# colMeans
# rowMeans
# colSums
# rowSums

Let's create a list using variables "bedrooms" and "bathrooms". 

In [13]:
x <- list(housing_prices$bedrooms, housing_prices$bathrooms)

In [14]:
str(x)

List of 2
 $ : int [1:21613] 3 3 2 4 3 4 3 3 3 3 ...
 $ : num [1:21613] 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...


## LAPPLY()

lapply() is used when you want to apply a function to each element of a list. Hence...the L in Lapply. A list of values is returned back for every element of the list. lapply(x) returns a list of the same length of x

In [18]:
res <- lapply(x, FUN = mean)
res

In [19]:
class(res)

## SAPPLY()

sapply() is used when you want to apply a function to each element of a list. In return, you will get a vector rather than a list. lapply() and sapply() are similar except for the return type of the result. 

In [21]:
res <- sapply(less_data, mean)
res
class(res)

In [23]:
res <- sapply(x, mean)
res
class(res)

## MAPPLY()

mapply() is used when we have several data structures (vectors, lists, etc) and we want to apply a function to all columns in a row. The result is coerced into a vector/array as in sapply. 

For examples, there are different variables measuring different areas like sqft_living, sqft_lot, sqft_above, 
sqft_base, sqft_living15, sqft_lot15. If we want to find the total area of each house then we can use this function. 

In [24]:
result <- mapply(sum, housing_prices$sqft_living, housing_prices$sqft_lot, housing_prices$sqft_above, housing_prices$sqft_basement, 
                    housing_prices$sqft_living15, housing_prices$sqft_lot15)


In [28]:
head(result, 10)
# the values in the mapply() for the first row = 1500....second row = 21711

In [26]:
class(result)

## TAPPLY()

tapply() is used when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor. 

It is a tabular version of apply(), which means that its input argument should have a categorical variable and its argument function is applied to each group. 

For example, we may want to know the average price of homes for each number of bedrooms in the house


In [30]:
t(tapply(housing_prices$price, housing_prices$bedrooms, mean))

0,1,2,3,4,5,6,7,8,9,10,11,33
409503.8,317642.9,401372.7,466232.1,635419.5,786599.8,825520.6,951184.7,1105077,893999.8,819333.3,520000,640000


## BY()

tapply() can be used to summarize one variable based on another variable. But what if we want to summarize many variables. The by function is like an extended version of tapply() command. 

In [31]:
byViews <- by(housing_prices[, c("price", "sqft_living")], housing_prices$view, summary)
byViews

housing_prices$view: 0
     price          sqft_living  
 Min.   :  75000   Min.   : 290  
 1st Qu.: 311000   1st Qu.:1390  
 Median : 432500   Median :1850  
 Mean   : 496564   Mean   :1998  
 3rd Qu.: 600000   3rd Qu.:2450  
 Max.   :5570000   Max.   :9200  
------------------------------------------------------------ 
housing_prices$view: 1
     price          sqft_living  
 Min.   : 217000   Min.   : 570  
 1st Qu.: 498750   1st Qu.:1855  
 Median : 690944   Median :2420  
 Mean   : 812281   Mean   :2569  
 3rd Qu.: 921250   3rd Qu.:3180  
 Max.   :3650000   Max.   :6300  
------------------------------------------------------------ 
housing_prices$view: 2
     price          sqft_living   
 Min.   : 169317   Min.   :  470  
 1st Qu.: 485000   1st Qu.: 1842  
 Median : 675000   Median : 2470  
 Mean   : 792401   Mean   : 2655  
 3rd Qu.: 941250   3rd Qu.: 3250  
 Max.   :7062500   Max.   :10040  
------------------------------------------------------------ 
housing_prices$view: 3
 

## 2 way tables

2 way tables are very informative. We have distribution of bathrooms for every count of bedrooms. It is very detailed and the sums of columns and rows are displayed which indicate number of bedrooms or bathrooms with a specific number

In [32]:
# produces a 2 way table with distribution count of every combination between bedrooms and bathrooms. 
# addmargins() will give the summary or sum of this counts at the end of both x and y axis. 

bed_vs_bath <- table(housing_prices$bedrooms, housing_prices$bathrooms)
addmargins(bed_vs_bath)

Unnamed: 0,0,0.5,0.75,1,1.25,1.5,1.75,2,2.25,2.5,...,5.5,5.75,6,6.25,6.5,6.75,7.5,7.75,8,Sum
0,7,0,1,1,0,1,0,0,0,3,...,0,0,0,0,0,0,0,0,0,13
1,3,1,27,138,2,12,4,6,4,2,...,0,0,0,0,0,0,0,0,0,199
2,0,2,26,1558,3,294,304,216,118,197,...,0,0,0,0,0,0,0,0,0,2760
3,0,0,16,1780,4,829,1870,1048,1082,2357,...,0,0,0,0,0,0,0,0,0,9824
4,0,1,2,325,0,254,719,525,709,2502,...,5,1,0,0,0,0,0,0,0,6882
5,0,0,0,43,0,48,134,110,116,287,...,4,2,4,2,1,1,0,0,0,1601
6,0,0,0,6,0,6,16,24,15,29,...,0,0,1,0,1,0,0,1,1,272
7,0,0,0,1,0,2,0,0,3,2,...,1,1,0,0,0,1,0,0,1,38
8,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,13
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,6
