Nicolas
April 25, 2016
data <- read.csv(file = 'PASTAPURCHASE.csv')
1. What is the mean and standard deviation of the quantity of pasta purchased by time unit by household?
mean(data$PASTA)
## [1] 1.841545
sd(data$PASTA)
## [1] 1.025911
The mean is 1.84 (qty/time unit/household) and the standard deviation is 1.02 (qty/time unit/household).
library(dplyr)
by_income <- data %>% group_by(INCOME) %>%
summarize(
area = mean(AREA)
)
by_income[ which.min(by_income$INCOME), ]
## Source: local data frame [1 x 2]
##
## INCOME area
## (dbl) (dbl)
## 1 609.1072 2
by_income[ which.max(by_income$INCOME), ]
## Source: local data frame [1 x 2]
##
## INCOME area
## (dbl) (dbl)
## 1 141067 5
The poorest household is in area #2 and the wealthiest household is in area #5.
3. What is the maximum pasta quantity a household has bought over the whole time period? (Sum the quantity of pasta by household over time and indicate the maximum)
pasta_qty <- data %>% group_by(HHID) %>% summarize( pasta_qty = sum(PASTA) )
max(pasta_qty$pasta_qty)
## [1] 55.36193
The maximum pasta quantity a household has bought over the whole time period is 55.36.
household_area4 <- data %>% filter(AREA == 4) %>%
group_by(HHID) %>% summarize( income = mean(INCOME) )
mean(household_area4$income)
## [1] 29260.13
# OR
mean(data[data$AREA == 4,]$INCOME)
## [1] 29260.13
The average income of households living in area 4 is $29,260.
5. How many households live in area 2, earn more than 20k, and have purchased more than 30 units of pasta over the whole time period?
household_area2 <- data %>% filter(AREA == 2) %>% filter(INCOME > 20000) %>%
group_by(HHID) %>%
summarize( total_qty = sum(PASTA) ) %>% filter(total_qty > 30)
nrow( household_area2 )
## [1] 218
218 households live in area 2, earn more than 20k, and have purchased more than 30 units of pasta over the whole time period.
library(ggplot2)
ggplot(data=data, aes(x=EXPOS,y=PASTA)) +
geom_jitter(shape = 21, colour = "black", fill = "orange") # for shapes that have a border (like 21)
cor.test(data$EXPOS,data$PASTA)
##
## Pearson's product-moment correlation
##
## data: data$EXPOS and data$PASTA
## t = 69.112, df = 39998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3178348 0.3353439
## sample estimates:
## cor
## 0.3266174
The correlation between the purchases of pasta and the exposures is 0.32 (positive correlation, although quite low).
7. Which of the following graphs reports the correct histogram by household of the total purchase of pasta made by the household over the whole period? (Sum the purchases by household and make a histogram.)
by_household <- data %>% group_by(HHID) %>% summarize( total_qty = sum(PASTA) )
ggplot(data=by_household, aes(x=total_qty)) + geom_histogram(color = 'black', fill = "tomato")
8. Which of the following graphs reports the correct time series of the overall total purchase of pasta? (Sum the purchases by time units and plot the quantity by time unit.)
by_time <- data %>% group_by(TIME) %>% summarize( total_qty = sum(PASTA) )
ggplot(data=by_time, aes(x=TIME,y=total_qty)) +
geom_line(colour = "darkred", size = 1) +
geom_smooth()