-
Notifications
You must be signed in to change notification settings - Fork 0
/
Homework1.Rmd
1768 lines (1416 loc) · 73.9 KB
/
Homework1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "STATS701 Homework 1"
author: "Jordan Farrer"
date: '2016-09-25'
output:
html_notebook:
code_folding: hide
css: style.css
number_sections: yes
theme: flatly
toc: yes
toc_float: yes
html_document:
code_folding: hide
css: style.css
number_sections: yes
theme: flatly
toc: yes
toc_float: yes
pdf_document:
number_sections: yes
toc: yes
---
# Setup
Full repo: [https://github.com/jrfarrer/stats701_hw1/](https://github.com/jrfarrer/stats701_hw1/)
Published file: [https://jrfarrer.github.io/stats701_hw1/](https://jrfarrer.github.io/stats701_hw1/)
Begin by setting up the R session, creating a logger function, and loading packages.
```{r setup, include=FALSE}
# Set options for the rmarkdown file
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, fig.align = 'center', width = 100, message = FALSE)
```
```{r setup2}
# Set the seed for reproducibility
set.seed(44)
# Set the locale of the session so languages other than English can be used
invisible(Sys.setlocale("LC_ALL", "en_US.UTF-8"))
# Prevent printing in scientific notation
options(digits = 4, width = 220)
# Create a logger function
logger <- function(msg, level = "info", file = log_file) {
cat(paste0("[", format(Sys.time(), "%Y-%m-%d %H:%M:%S.%OS"), "][", level, "] ", msg, "\n"), file = stdout())
}
# Set the project directory
base_dir <- ''
data_dir <- paste0(base_dir, "data/")
code_dir <- paste0(base_dir, "code/")
viz_dir <- paste0(base_dir, "viz/")
dir.create(data_dir, showWarnings = FALSE)
dir.create(code_dir, showWarnings = FALSE)
dir.create(viz_dir, showWarnings = FALSE)
```
```{r Load Packages}
# Create a function that will be used to load/install packages
fn_load_packages <- function(p) {
if (!is.element(p, installed.packages()[,1]) || (p =="DT" && !(packageVersion(p) > "0.1"))) {
if (p == "DT") {
devtools::install_github('rstudio/DT')
} else {
install.packages(p, dep = TRUE, repos = 'http://cran.us.r-project.org')
}
}
a <- suppressPackageStartupMessages(require(p, character.only = TRUE))
if (a) {
logger(paste0("Loaded package ", p, " version ", packageVersion(p)))
} else {
logger(paste0("Unable to load packages ", p))
}
}
# Create a vector of packages
packages <- c('tidyverse','ggthemes','knitr','readxl','broom','forecast','stringr',
'ISLR','GGally','gridExtra','leaps','extrafont','pander')
# Use function to load the required packages
invisible(lapply(packages, fn_load_packages))
```
```{r}
# To the font second font, run the following two lines of code and add name of user to vector
# system(paste0("cp -r ",viz_dir,"fonts/. ~/Library/Fonts/")) # instantaneous
# font_import() # takes approximately 5-10 min
users_v <- c("Jordan")
```
```{r Create palette and theme}
# Create a color palette
pal538 <- ggthemes_data$fivethirtyeight
# Create a theme to use throughout the analysis
theme_jrf <- function(base_size = 8, base_family = ifelse(Sys.info()[['user']] %in% users_v, "DecimaMonoPro", "Helvetica")) {
theme(
plot.background = element_rect(fill = "#F0F0F0", colour = "#606063"),
panel.background = element_rect(fill = "#F0F0F0", colour = NA),
panel.border = element_blank(),
panel.grid.major = element_line(colour = "#D7D7D8"),
panel.grid.minor = element_line(colour = "#D7D7D8", size = 0.25),
panel.margin = unit(0.25, "lines"),
panel.margin.x = NULL,
panel.margin.y = NULL,
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
axis.title = element_text(colour = "#A0A0A3"),
axis.text.x = element_text(vjust = 1, family = 'Helvetica', colour = '#3C3C3C'),
axis.text.y = element_text(hjust = 1, family = 'Helvetica', colour = '#3C3C3C'),
legend.background = element_blank(),
legend.key = element_blank(),
plot.title = element_text(face = 'bold', colour = '#3C3C3C', hjust = 0),
text = element_text(size = 9, family = ifelse(Sys.info()[['user']] %in% users_v,"DecimaMonoPro", "Helvetica")),
title = element_text(family = ifelse(Sys.info()[['user']] %in% users_v,"DecimaMonoPro", "Helvetica"))
)
}
```
# Question 2
## Data Loading
Let's use Hadley's readr package to load the dataset, using the col_name parameter to set the column names of the tibble.
```{r Load and Clean data, message = FALSE, warning = FALSE}
# Load the csv with meaningful column names
survey_results <- read_csv(paste0(data_dir,'Survey_results_final.csv'), skip = 1,
col_names = c('hitid','hittypeid','title','description','keywords',
'reward','creationtime','maxassignments','requesterannotation',
'assignmentdurationinseconds','autoapprovaldelayinseconds',
'expiration','numberofsimilarhits','lifetimeinseconds',
'assignmentid','workerid','assignmentstatus','accepttime',
'submittime','autoapprovaltime','approvaltime','rejectiontime',
'requesterfeedback','worktime','lifetimeapprovalrate',
'last30daysapprovalrate','last7daysapprovalrate','age',
'education','gender','income','sirius','wharton','approve','reject'))
# Print a few records in the tibble
survey_results %>%
select(age, education, gender, income, sirius, wharton, worktime)
# Put into a new tibble we'll use for cleaning (there will be a final later)
survey_results_cleaning <- survey_results
```
## Data Cleaning
We'll sequentially clean each of the primary variables of the dataset and create exploratory summaries.
### Age
Let's quickly summarize the age variable, noting that it is a character.
```{r Data cleaning - Age1}
survey_results_cleaning %>% group_by(age) %>% summarise(cnt = n()) %>% arrange(age)
```
We correct some errant values, using our judgement as **data analysts** and plot a histogram.
```{r Data cleaning - Age2}
survey_results_cleaning <-
survey_results %>%
mutate(
age2 = ifelse(age == 'Eighteen (18)', "18", ifelse(age == 'female', NA, ifelse(age == "27`", "27", age)))
, age2 = as.integer(age2)
)
ggplot(survey_results_cleaning, aes(x = age2)) +
geom_point(aes(x = 4, y = 1), shape = 1, colour = pal538['red'], fill = NA, size = 6, stroke = 1) +
geom_point(aes(x = 223, y = 1), shape = 1, colour = pal538['red'], fill = NA, size = 6, stroke = 1) +
geom_histogram(binwidth = 1, fill = pal538['blue']) +
theme_jrf() +
scale_x_continuous(expand = c(0.05, 0.01)) + scale_y_continuous(expand = c(0.02, 0.01)) +
labs(title = "Age", y = "Count", x = "Age (years)")
```
It looks like we still missed some bad values.
```{r Data cleaning - Age3}
sort(unique(survey_results_cleaning$age2))
```
We fix those too and plot the histogram.
```{r Data cleaning - Age4}
survey_results_cleaning <-
survey_results_cleaning %>%
mutate(
age3 = ifelse(age2 %in% c(4, 223), NA, age2)
)
ggplot(survey_results_cleaning, aes(x = age3)) + geom_histogram(binwidth = 1, fill = pal538['blue']) +
theme_jrf() +
scale_x_continuous(expand = c(0.05, 0.01)) + scale_y_continuous(expand = c(0.02, 0.01)) +
labs(title = "Age", y = "Count", x = "Age (years)")
```
### Education
Let's look at the unique values and counts.
```{r Data cleaning - Education1}
survey_results_cleaning %>% group_by(education) %>% summarise(cnt = n()) %>% arrange(education)
```
It appears that `r nrow(filter(survey_results_cleaning, education == "select one"))` respondents left the survey on the default which read 'select one'. We'll update that to 'Other' and modify this variable to be a factor.
```{r Data cleaning - Education2}
survey_results_cleaning <-
survey_results_cleaning %>%
mutate(
education2 = ifelse(education == "select one", "Other", education)
, education2 = factor(education2, levels = c('Less than 12 years; no high school diploma'
, 'High school graduate (or equivalent)'
, 'Some college, no diploma; or Associate’s degree'
, 'Bachelor’s degree or other 4-year degree'
, 'Graduate or professional degree'
, 'Other'))
)
survey_results_cleaning %>% group_by(education2) %>% summarise(cnt = n()) %>% arrange(education2)
```
### Gender
We'll summarize the gender variable.
```{r Data cleaning - Gender1}
survey_results_cleaning %>% group_by(gender) %>% summarise(cnt = n()) %>% arrange(gender)
```
We update this to be a factor.
```{r Data cleaning - Gender2}
survey_results_cleaning <-
survey_results_cleaning %>%
mutate(gender2 = as.factor(gender))
survey_results_cleaning %>%
group_by(gender2) %>%
summarise(cnt = n()) %>%
arrange(gender2) %>%
mutate(prop = cnt / sum(cnt))
```
### Income
```{r Data clean - Income1}
survey_results_cleaning %>% group_by(income) %>% summarise(cnt = n()) %>% arrange(income)
```
Let's convert this to a factor variable.
```{r Data clean - Income2}
survey_results_cleaning <-
survey_results_cleaning %>%
mutate(
income2 = factor(income, levels = c('Less than $15,000'
, '$15,000 - $30,000'
, '$30,000 - $50,000'
, '$50,000 - $75,000'
, '$75,000 - $150,000'
, 'Above $150,000'))
)
survey_results_cleaning %>% group_by(income2) %>% summarise(cnt = n()) %>% arrange(income2)
```
### Sirius and Wharton
```{r Data clean - sirius_wharton1}
survey_results_cleaning %>% group_by(sirius) %>% summarise(cnt = n()) %>% arrange(sirius)
survey_results_cleaning %>% group_by(wharton) %>% summarise(cnt = n()) %>% arrange(wharton)
```
Let's convert these to factors for better analysis capabilities.
```{r Data clean - sirius_wharton2}
survey_results_cleaning <-
survey_results_cleaning %>%
mutate(
sirius2 = factor(sirius, levels = c("Yes","No"))
, wharton2 = factor(wharton, levels = c("Yes","No"))
)
survey_results_cleaning %>% group_by(sirius2) %>% summarise(cnt = n()) %>% arrange(sirius2)
survey_results_cleaning %>% group_by(wharton2) %>% summarise(cnt = n()) %>% arrange(wharton2)
```
### Worktime
```{r Data clean - Worktime1, fig.align = 'center'}
ggplot(survey_results_cleaning, aes(x = worktime)) + geom_histogram(binwidth = 1, fill = pal538['blue']) +
theme_jrf() +
scale_x_continuous(expand = c(0.05, 0.01)) + scale_y_continuous(expand = c(0.02, 0.01)) +
labs(title = "Worktime", y = "Count", x = "Worktime in Seconds")
```
### Final Data Frame
We select and rename the columns.
```{r Data Clean - Final}
survey_results_cleaning <-
survey_results_cleaning %>%
select(age3, education2, gender2, income2, sirius2, wharton2, worktime) %>%
rename(
age = age3
, education = education2
, gender = gender2
, income = income2
, sirius = sirius2
, wharton = wharton2
)
```
Let's review the records with missing data.
```{r}
survey_results_cleaning[!complete.cases(survey_results_cleaning), ] %>% print(width = Inf)
```
We will remove the `r nrow(filter(survey_results_cleaning, is.na(sirius) | is.na(wharton)))` records that have NAs for sirius or wharton. Without information about the response, we will have trouble making an estimate of *p*, the proportion of Sirius listeners who listened to Business Radio Powered by the Wharton School.
```{r}
survey_results_cleaning %>%
filter(is.na(sirius) | is.na(wharton))
```
We put together a final data frame.
```{r}
survey_results_final <-
survey_results_cleaning %>%
filter(!is.na(sirius) & !is.na(wharton))
```
## Summary
We previously listened to the podcast Planet Money's [episode](http://www.npr.org/sections/money/2015/01/30/382657657/episode-600-the-people-inside-your-machine) about Amazon's Mechanical Turk program.
```{r Summary - Summary Stata, results = 'asis'}
pander(summary(survey_results_final), missing = "", split.table = Inf)
```
|Variable|Class|Description|
|---|-----|-----|
|Age|Integer|The age in years of the survey respondent|
|Education|Factor|Level of education attain by the survey respondent|
|Gender|Factor|Gender indicated by the survey respondent (Male or Female)|
|Income|Factor|Income level provided by the survey respondent|
|Sirius|Factor|Response to "Have you ever listened to Sirius Radio?"|
|Wharton|Factor|Response to "Have you ever listened to Sirius Business Radio by Wharton?"|
|Worktime|Integer|Number of second spent completing the survey|
## Sample properties
### (1)
On the surface, we have no reason to believe that the MTURK dataset could be representative of the US population. Knowledge of MTURK is not universal and attracts particular types of individuals willing to perform many small tasks for a minor reward (from Planet Money podcast).
First, we quickly see that the proportion of Sirius listeners is much higher than the given proportion. If the US population is [321.4 million](https://www.census.gov/quickfacts/), then the proportion of Sirius listeners is
$$\frac{51.6}{321.4} = `r round(51.6/321.4, 4)`$$
However, we quickly see that in the survey data from MTURK, the proportion of Sirius listeners is much higher.
```{r}
(sirius_prop <- survey_results_final %>% summarise(prop_sirius = sum(sirius == "Yes") / n()))
```
We see that of the survey respondents, `r round(100* sirius_prop,2)`% say that have listened to Sirius.
Second, in order to answer the question "Does this appear to be a random sample from the US population?" empirically we can look at the four characteristics in our final dataset
1. Age
2. Gender
3. Education
4. Income
For age and gender, we download a table called "Population by Age" from US Census Bureau's [Current Population Survey](http://www.census.gov/population/age/data/files/2012/2012gender_table1.csv) in 2012.
```{r}
census_age_gender <- read_csv(url("http://www.census.gov/population/age/data/files/2012/2012gender_table1.csv"),
skip = 6,
col_names = c("age", "all","all_percent","male","male_percent","female","female_percent"),
col_types = cols(age = col_character(),
all = col_number(),
all_percent = col_number(),
male = col_number(),
male_percent = col_number(),
female = col_number(),
female_percent = col_number())
)
census_age_gender
```
We will need to bucket our MTURK dataset to match the categories of the Census Bureau's. In doing so we remove the `r nrow(filter(survey_results_final, age < 20 | is.na(gender)))` records with an age 18-19 and without a listed gender.
```{r}
actual <-
survey_results_final %>%
filter(age >= 20 & !is.na(gender)) %>%
mutate(age_bucket = paste0(floor(age / 10), "0 to ", floor(age / 10),"9 years")) %>%
group_by(age_bucket, gender) %>%
summarise(
n = n()
) %>%
ungroup() %>%
mutate(source = "Actual") %>%
select(source, age_bucket, gender, n)
actual_size <- sum(actual$n)
actual
```
Next we clean the Census Bureau's dataset and scale the expected number of individuals to our dataset size of `r actual_size`.
```{r}
expected <-
census_age_gender %>%
filter(row_number() <= 19) %>%
select(age, male, female) %>%
mutate(age = gsub("\\.","", age)) %>%
filter(!(age %in% c('Under 5 years','All ages','5 to 9 years','10 to 14 years','15 to 19 years'))) %>%
mutate(age_bucket = paste0(substring(age,1,1), "0 to ", substring(age,1,1),"9 years")) %>%
mutate(age_bucket = ifelse(age_bucket == "80 to 89 years", "80 years plus", age_bucket)) %>%
select(-age) %>%
gather(gender, n, -age_bucket) %>%
group_by(age_bucket, gender) %>%
summarise(n = sum(n)) %>%
ungroup() %>%
mutate(gender = paste0(toupper(substring(gender,1,1)), substring(gender, 2, 999))) %>%
mutate(percent = n / sum(n)) %>%
mutate(Expected = actual_size * percent) %>%
select(age_bucket, gender, Expected) %>%
gather(source, n, -age_bucket, -gender) %>%
select(source, age_bucket, gender, n)
expected
```
Then we can combine the two.
```{r}
actual_expected <-
union(actual, expected) %>%
mutate(
source = factor(source, levels = c("Actual","Expected"))
, gender = factor(gender, levels = c("Male","Female"))
)
```
We find that the MTURK sample is younger and more male the US population. For example, in a sample `r actual_size` we would expect to find `r actual_expected %>% filter(source == "Expected" & age_bucket == "20 to 29 years" & gender == "Male") %>% select(n) %>% unlist()` males, 20 to 29 years old. However, in the MTURK sample there are `r actual_expected %>% filter(source == "Actual" & age_bucket == "20 to 29 years" & gender == "Male") %>% select(n) %>% unlist()` males, 20 to 29 years old, or `r actual_expected %>% filter(source == "Actual" & age_bucket == "20 to 29 years" & gender == "Male") %>% select(n) %>% unlist() - actual_expected %>% filter(source == "Expected" & age_bucket == "20 to 29 years" & gender == "Male") %>% select(n) %>% unlist()` more than expected. In addition, in the US population we would expect `r actual_expected %>% group_by(source, gender) %>% summarise(sum = sum(n)) %>% mutate(p = round(100 * sum / sum(sum), 1)) %>% filter(source == "Expected" & gender == "Female") %>% unlist()`% females and `r actual_expected %>% group_by(source, gender) %>% summarise(sum = sum(n)) %>% mutate(p = round(100 * sum / sum(sum), 1)) %>% filter(source == "Expected" & gender == "Male") %>% unlist()`% males. However, the MTUK sample has `r actual_expected %>% group_by(source, gender) %>% summarise(sum = sum(n)) %>% mutate(p = round(100 * sum / sum(sum), 1)) %>% filter(source == "Actual" & gender == "Female") %>% unlist()`% females and `r actual_expected %>% group_by(source, gender) %>% summarise(sum = sum(n)) %>% mutate(p = round(100 * sum / sum(sum), 1)) %>% filter(source == "Actual" & gender == "Male") %>% unlist()`% males.
```{r fig.height=7, fig.width=8}
actual_expected %>%
ggplot(aes(x = source, y = n, fill = source)) +
geom_bar(stat = "identity") +
coord_flip() +
facet_grid(age_bucket ~ gender, switch = "y") +
theme_jrf() +
labs(title = "MTURK is younger and more male than US Population",
y = paste0("Respondents to Survey (", actual_size, ")"), x = NULL) +
scale_fill_manual(values = c(Actual = pal538['blue'][[1]], Expected = pal538['red'][[1]])) +
guides(fill = FALSE) +
geom_text(aes(label = round(n, 0)), hjust = 0, family = "DecimaMonoPro") +
scale_y_continuous(expand = c(0.02, 40)) +
theme(strip.text.y = element_text(size = 6))
```
Looking at education, we download data the US Census Bureau's [Current Population Report ](https://www.census.gov/hhes/socdemo/education/data/cps/2015/Table%201-01.csv) that shows statistics on educational attainment. The data is by age and gender, but we aggregate the age section to the total population to compare to the MTURK sample. The table below shows expected vs actual proportions.
```{r message = FALSE, warning = FALSE, error = FALSE}
census_edu <- read_csv(url("https://www.census.gov/hhes/socdemo/education/data/cps/2015/Table%201-01.csv"),
skip = 5)
edu_expected <-
census_edu %>%
select(1, 3:17) %>%
filter(row_number() %in% c(2:15)) %>%
select(-X1) %>%
mutate(`Doctoral degree` = as.integer(gsub(",","",`Doctoral degree`))) %>%
summarise_each(funs(sum(., na.rm =TRUE))) %>%
gather(education, n) %>%
mutate(
education = ifelse(education == "None", "Other",
ifelse(education %in% c("1st - 4th grade","5th - 6th grade","7th - 8th grade","9th grade",
"10th grade","11th grade /2"), "Less than 12 years; no high school diploma",
ifelse(education == "High school graduate", "High school graduate (or equivalent)",
ifelse(education %in% c("Some college, no degree","Associate's degree, occupational",
"Associate's degree, academic"),
"Some college, no diploma; or Associate’s degree",
ifelse(education == "Bachelor's degree", "Bachelor’s degree or other 4-year degree",
ifelse(education %in% c("Master's degree","Professional degree","Doctoral degree"),
"Graduate or professional degree", NA))))))
, education = factor(education, levels = c('Less than 12 years; no high school diploma'
, 'High school graduate (or equivalent)'
, 'Some college, no diploma; or Associate’s degree'
, 'Bachelor’s degree or other 4-year degree'
, 'Graduate or professional degree'
, 'Other'))
) %>%
group_by(education) %>%
summarise(expected_n = sum(n)) %>%
ungroup() %>%
mutate(expected = expected_n / sum(expected_n))
edu_actual <-
survey_results_final %>%
group_by(education) %>%
summarise(actual_n = n()) %>%
ungroup() %>%
mutate(actual = actual_n / sum(actual_n))
comparison_tbl_edu <-
inner_join(edu_expected, edu_actual, by = c("education")) %>%
mutate(
expected = paste0(round(100*expected,1), "%")
, actual = paste0(round(100*actual,1), "%")
) %>%
select(-actual_n, -expected_n)
comparison_tbl_edu
```
We find that the MTURK sample over indexes on people have been to college or graduated from college. Notably, in a sample of the US population we would expect to find `r comparison_tbl_edu %>% filter(education == "Some college, no diploma; or Associate’s degree") %>% select(expected) %>% unlist()` of people to have 'Some college, no diploma; or Associate’s degree', but in the MTURK sample `r comparison_tbl_edu %>% filter(education == "Some college, no diploma; or Associate’s degree") %>% select(actual) %>% unlist()` fit this category of educational attainment.
We use a proportions test to determine if the proportions are indeed different.
```{r}
edu_matrix <- inner_join(edu_expected, edu_actual, by = c("education")) %>% select(expected_n, actual_n) %>% as.matrix
(edu_prop_test <- prop.test(edu_matrix))
```
Using `r edu_prop_test$method` we have strong evidence against the null hypothesis that the proportions in the education groups are the same. This provides further evidence that the MTURK sample does not represent the US population.
Looking at income, we download from the US Census Bureau statistics on [personal income](http://www.census.gov/data/tables/time-series/demo/income-poverty/cps-pinc/pinc-01.html).
```{r}
download.file("http://www2.census.gov/programs-surveys/cps/tables/pinc-01/2016/pinc01_1_1_1.xls",
destfile = paste0(data_dir, 'pinc01_1_1_1.xls'), mode = "wb")
income <- read_excel(paste0(data_dir, 'pinc01_1_1_1.xls'), skip = 8)
income_expected <-
income[, c(4:44)] %>%
filter(row_number() == 2) %>%
gather(income, n) %>%
mutate(
n = as.integer(n)
) %>%
select(income, n) %>%
mutate(
income = ifelse(row_number() <= 6, "Less than $15,000",
ifelse(row_number() <= 12, "$15,000 - $30,000",
ifelse(row_number() <= 20, "$30,000 - $50,000",
ifelse(row_number() <= 30, "$50,000 - $75,000",
"Above $75,000"))))
, income = factor(income, levels = c("Less than $15,000","$15,000 - $30,000", "$30,000 - $50,000",
"$50,000 - $75,000", "Above $75,000"))
) %>%
group_by(income) %>%
summarise(
n = sum(n)
) %>%
ungroup() %>%
mutate(expected = n / sum(n)) %>%
mutate(expected_n = n) %>%
select(income, expected_n, expected)
income_actual <-
survey_results_final %>%
filter(!is.na(income)) %>%
mutate(
income = as.character(income)
, income = ifelse(income %in% c("$75,000 - $150,000","Above $150,000"), "Above $75,000", income)
, income = factor(income, levels = c("Less than $15,000","$15,000 - $30,000", "$30,000 - $50,000",
"$50,000 - $75,000", "Above $75,000"))
) %>%
group_by(income) %>%
summarise(
n = n()
) %>%
ungroup() %>%
mutate(actual = n / sum(n)) %>%
mutate(actual_n = n) %>%
select(income, actual_n, actual)
comparison_tbl_income <-
inner_join(income_expected, income_actual, by = c("income")) %>%
mutate(
expected = paste0(round(100*expected,1), "%")
, actual = paste0(round(100*actual,1), "%")
) %>%
select(-actual_n, -expected_n)
comparison_tbl_income
```
Looking at the table above we see there is a smaller percentage of lower income respondents than expected (`r comparison_tbl_income %>% filter(income == "Less than $15,000") %>% select(expected) %>% unlist()` vs. `r comparison_tbl_income %>% filter(income == "Less than $15,000") %>% select(actual) %>% unlist()`). In addition, there is larger percentage of high earning respondents than expected (`r comparison_tbl_income %>% filter(income == "Above $75,000") %>% select(expected) %>% unlist()` vs. `r comparison_tbl_income %>% filter(income == "Above $75,000") %>% select(actual) %>% unlist()`).
We use a proportions test to determine if the proportions are indeed different.
```{r}
income_matrix <- inner_join(income_expected, income_actual, by = c("income")) %>% select(expected_n, actual_n) %>% as.matrix
(income_prop_test <- prop.test(income_matrix))
```
Using `r income_prop_test$method` we have strong evidence against the null hypothesis that the proportions in the income groups are the same. This provides further evidence that the MTURK sample does not represent the US population.
### (2)
Though we might be concerned that our sample does not represent the MTURK population as a whole we have no evidence to support this from the data provided. There should be concern that someone who opts to participate in a survey about satellite radio might be more likely to be a satellite radio listener (unless MTURK workers are much more likely to be Sirius listeners). However, we have no evidence to support this claim.
In thinking about this question we read the articles [“Who are these people?” Evaluating the demographic characteristics and political preferences of MTurk survey respondents](http://scholar.harvard.edu/dtingley/files/whoarethesepeople.pdf) and ["Demographics of Mechanical Turk"](https://www.researchgate.net/publication/228140347_Demographics_of_Mechanical_Turk).
### (3)
In order to estimate the number of Wharton listeners in the US we will create a 95% confidence interval of the proportion of Wharton listeners in the MTURK dataset and multiply this by the Sirius radio listeners (51.6 million).
$$\hat{p} \pm z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
```{r}
p_hat <-
survey_results_final %>%
filter(sirius == "Yes") %>%
summarise(p_hat = sum(wharton == "Yes") / n()) %>%
unlist()
ci2 <- c(p_hat - qt(0.975, nrow(survey_results_final)) * sqrt(p_hat*(1-p_hat) / nrow(survey_results_final))
, p_hat + qt(0.975, nrow(survey_results_final)) * sqrt(p_hat*(1-p_hat) / nrow(survey_results_final)))
pop_p <- p_hat * 51.6
pop_ci <- round(ci2 * 51.6,2)
```
We estimate the sample proportion to be **`r p_hat`** and the 95% confidence interval to be
$$(`r ci2[1]`, `r ci2[2]`)$$
Thus we estimate the size of the Wharton listeners in the US to be **`r round(pop_p,2)`** million and the 95% confidence interval to be (in millions)
$$(`r pop_ci[1]`, `r pop_ci[2]`)$$
## Brief Report
We have reviewed the survey of `r nrow(survey_results_cleaning)` respondents of the MTURK survey. We have evidence that the survey respondents do not represent that population of the US based on the proportion of Sirius listeners (`r 51.6/321.4` vs `r sirius_prop`) and age, gender, income, and education characteristics. However, assuming that the sample represents the population, we estimate that there are between `r pop_ci[1]` and `r pop_ci[2]` million listeners of "Business Radio Powered by the Wharton School".
# Question 3
## Part A
```{r}
x <- seq(0, 1, length = 40)
y <- 1 + 1.2*x + rnorm(40, mean = 0, sd = 2)
ggplot(data_frame(x, y), aes(x = x, y = y)) + geom_point(colour = pal538['blue']) +
theme_jrf() +
scale_x_continuous(expand = c(0.05, 0.01)) + scale_y_continuous(expand = c(0.02, 0.01)) +
labs(title = "Scatterplot of (x,y) pairs", y = "y", x = "x")
```
We use the the lm function to create a linear model.
```{r results = 'asis'}
fit1 <- lm(y ~ x)
tidy(fit1) %>% pander()
```
We find that $\beta_0=`r fit1$coefficients['(Intercept)']`$ and $\beta_1=`r fit1$coefficients['x']`$. Next we overlay LS equation on the scatterplot.
```{r fig.align = 'center'}
ggplot(data = fit1$model, aes(x = x, y = y)) + geom_point(colour = pal538['blue']) +
geom_smooth(method="lm", se = TRUE, colour = pal538['red']) +
theme_jrf() +
scale_x_continuous(expand = c(0.05, 0.01)) + scale_y_continuous(expand = c(0.02, 0.01)) +
labs(title = "LS Equation", y = "y", x = "x")
```
The 95% confidence interval for $\beta_1$ is
$$`r fit1$coefficients['x']` \pm 1.96\times `r coef(summary(fit1))[, "Std. Error"]["x"]`$$
or
$$(`r fit1$coefficients['x'] - qt(0.975, 38)*coef(summary(fit1))[, "Std. Error"]["x"]`, `r fit1$coefficients['x'] + qt(0.975, 38)*coef(summary(fit1))[, "Std. Error"]["x"]`)$$
This 95% confidence interval does indeed contain the true $\beta_1$ which is $1.2$.
The RSE is `r sigma(fit1)` which is very close to the true standard deviation of the error of $\sigma = 2$.
## Part B
We begin with the given simulation code chunk:
```{r }
x <- seq(0, 1, length = 40)
n_sim <- 100
b1 <- numeric(n_sim) # nsim many LS estimates of beta1 (=1.2)
upper_ci <- numeric(n_sim) # lower bound
lower_ci <- numeric(n_sim) # upper bound
t_star <- qt(0.975, 38)
# Carry out the simulation
for (i in 1:n_sim){
y <- 1 + 1.2 * x + rnorm(40, sd = 2)
lse <- lm(y ~ x)
lse_out <- summary(lse)$coefficients
se <- lse_out[2, 2]
b1[i] <- lse_out[2, 1]
upper_ci[i] <- b1[i] + t_star * se
lower_ci[i] <- b1[i] - t_star * se
}
```
We will summarize $\beta_1$.
```{r results = 'asis'}
summary(b1) %>% pander()
```
```{r }
ggplot(data = data_frame(b1 = b1), aes(x = b1)) + geom_histogram(binwidth = 0.2, fill = pal538['blue']) +
geom_vline(xintercept = 1.2, colour = pal538['red']) +
geom_label(aes(x = 1.2, y = Inf, label = 'beta[1] == 1.2'),
vjust = "inward", hjust = "inward", parse = TRUE, colour = pal538['red']) +
theme_jrf() +
scale_x_continuous(expand = c(0.05, 0.01)) + scale_y_continuous(expand = c(0.02, 0.01)) +
labs(title = expression("Histogram of LS estimates "~b[1]~" of "~beta[1]), y = "Count", x = expression(b[1]))
```
The sampling distribution does agree with the theory as most of the LS estimate of $\beta_1$ are close to 1.2.
```{r }
ci <- data_frame(n = 1:100, b1 = b1 , lower_ci = lower_ci, upper_ci = upper_ci,
covers = factor(ifelse(lower_ci < 1.2 & upper_ci > 1.2, "Yes", "No"), levels = c("Yes", "No")))
```
We find that `r sum(ci$covers == "Yes")` out of `r nrow(ci)` 95% confidence intervals cover the true $\beta_1$. We show this graphically below, where the red intervals do not cover the true $\beta_1$ and the green intervals do cover the true $\beta_1$.
```{r fig.align = 'center', fig.width=7, fig.height=7}
ggplot(data = ci) +
geom_vline(xintercept = 1.2) +
geom_segment(aes(x = lower_ci, xend = upper_ci, y = n, yend = n, colour = covers)) +
labs(title = "100 Sample Confidence Intervals", y = NULL, x = expression(beta[1])) +
geom_label(aes(x = 1.2, y = Inf, label = 'beta[1] == 1.2'), vjust = "inward", hjust = "inward", parse = TRUE) +
guides(color = guide_legend(title = expression("Covers "~beta[1]~"?"))) +
theme(legend.position = 'bottom') +
theme_jrf() +
scale_x_continuous(expand = c(0.05, 0.01)) + scale_y_continuous(expand = c(0.02, 0.01)) +
scale_colour_manual(values = c('Yes' = pal538['green'][[1]], 'No' = pal538['red'][[1]]))
```
# Question 4
## Summary
We begin by loading and tidying the ML Pay dataset.
```{r message = FALSE, warning = FALSE}
# Read in the ML Pay dataset
ml_pay <- read_csv(paste0(data_dir, "MLPayData_Total.csv"))
# Let's tidy the dataset
ml_pay2 <-
ml_pay %>%
rename(team = Team.name.2014) %>%
gather(metric_raw, value, -payroll, -avgwin, -team) %>%
mutate(
year = as.factor(str_extract(metric_raw, "(\\d)+"))
, metric = ifelse(substring(metric_raw,1,1) == "p", "payroll",
ifelse(str_detect(metric_raw, ".pct"), "avgwin", "wins"))
) %>%
select(team, year, metric, value, payroll, avgwin)
ml_pay2
```
Let's do a few data quality checks, where we ensure there are 30 teams per year and there are no missing values.
```{r}
# Show there are 30 unique teams per year
ml_pay2 %>%
group_by(year) %>%
summarise(
n = n()
, n_distinct = n_distinct(team)
)
# Show that there are no missing values
ml_pay2 %>%
summarise(
na = sum(is.na(value))
, nan = sum(is.nan(value))
)
```
For the 17 years between 1998 and 2014, we summarize the payroll of the 30 teams.
```{r}
ml_pay2 %>%
filter(metric == "payroll") %>%
group_by(year) %>%
summarise(
min = min(value)
, p25 = quantile(value, .25)
, p50 = quantile(value, .5)
, mean = mean(value)
, p75 = quantile(value, .75)
, max = max(value)
)
```
The box-plot below show there was a general growth in payroll spending over the 17 years in the MLB. The outlier at the high end of the payroll scale is the New York Yankees.
```{r}
ml_pay2 %>%
filter(metric == "payroll") %>%
ggplot(aes(year, value)) + geom_boxplot(fill = pal538['blue']) +
theme_jrf() +
labs(title = "Payroll Growth", y = "Team Payroll ($m)", x = NULL)
```
Let's identify what the year-over-year (yoy) growth in payroll has been by team.
```{r}
ml_pay2 %>%
filter(metric == "payroll") %>%
arrange(team, year) %>%
group_by(team) %>%
mutate(
yoy_growth = (value - lag(value)) / lag(value)
) %>%
group_by(team) %>%
summarise(
avg_yoy_growth = mean(yoy_growth, na.rm = TRUE)
) %>%
ungroup() %>%
arrange(desc(avg_yoy_growth)) %>%
print(n = 30)
```
Let's summarize this as the yoy growth overall.
```{r}
avg_yoy_growth <-
ml_pay2 %>%
filter(metric == "payroll") %>%
arrange(team, year) %>%
group_by(team) %>%
mutate(
yoy_growth = (value - lag(value)) / lag(value)
) %>%
group_by(team) %>%
summarise(
avg_yoy_growth = mean(yoy_growth, na.rm = TRUE)
) %>%
ungroup() %>%
summarise(
avg_yoy_growth = mean(avg_yoy_growth)
) %>%
unlist()
avg_yoy_growth
```
Next, we summarize the winning percentage for the 17 years. This is not particularly meaningful, but it is a way to identify any errant values.
```{r}
ml_pay2 %>%
filter(metric == "avgwin") %>%
group_by(year) %>%
summarise(
min = min(value)
, p25 = quantile(value, .25)
, p50 = quantile(value, .5)
, mean = mean(value)
, p75 = quantile(value, .75)
, max = max(value)
)
```
The box-plot below shows the dispersion of winning percentage overtime. You can see the dot in 2003 is the Detroit Tigers who lost more games than any American League team in history (43-119).
```{r}
ml_pay2 %>%
filter(metric == "avgwin") %>%
ggplot(aes(year, value)) + geom_boxplot(fill = pal538['blue']) +
scale_y_continuous(labels = scales::percent) +
theme_jrf() +
labs(title = "Winning Percentage", y = "Winning Percentage", x = NULL)
```
Let's summarize the two variables across time to get an idea of where the values fall.
```{r results = 'asis'}
ml_pay2 %>%
filter(metric %in% c("payroll","avgwin")) %>%
group_by(metric) %>%
summarise(
min = min(value)
, p25 = quantile(value, .25)
, p50 = quantile(value, .5)
, mean = mean(value)
, p75 = quantile(value, .75)
, max = max(value)
) %>%
pander()
```
Next, let's look at scatter plots of payroll vs winning percentage over the 17 years. This plot helps highlight the fact that average payroll increases over time.
```{r fig.width=7, fig.height=6}
ml_pay2 %>%
filter(metric %in% c("payroll","avgwin")) %>%
select(-payroll, -avgwin) %>%
spread(metric, value) %>%
ggplot(aes(x = payroll, y = avgwin)) + facet_wrap(~ year) +
geom_point(colour = pal538['blue'], alpha = 0.75) +
geom_smooth(method = "lm", se = FALSE, colour = pal538['red']) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Payroll vs Winning Percentage", y = "Winning Percentage", x = "Payroll ($m)") +
theme_jrf() +
geom_text(data =
. %>%
group_by(year) %>%
summarise(
cor = cor(payroll, avgwin)
),
aes(x = 170, y = .7, label = paste0("cor = ", round(cor, 3))),
size = 3
)
avg_person_cor <-
ml_pay2 %>%
filter(metric %in% c("payroll","avgwin")) %>%
select(-payroll, -avgwin) %>%
spread(metric, value) %>%
group_by(year) %>%
summarise(
cor = cor(payroll, avgwin)
) %>%
ungroup() %>%
summarise(
cor = mean(cor)
) %>%
unlist()
# Avg Person Correlation
avg_person_cor
```
Let's show the trends in the two variable by team.
```{r}
ml_pay2 %>%
filter(metric %in% c("payroll","avgwin")) %>%
ggplot(aes(x = year, y = value, group = team)) +
facet_grid(metric ~ ., scales = "free_y", switch = "y",
labeller = ggplot2::labeller(metric = c(avgwin = "Winning Percentage", payroll = "Payroll"))) +
geom_line(colour = pal538['blue']) +
guides(color = FALSE) +
theme_jrf() +
labs(title = "Payroll and Winning Percentage by Team", y = NULL, x = NULL)
```
Summary of Exploratory Analysis:
1. Payroll has generally been increasing - the yoy average growth is **`r round(avg_yoy_growth * 100,2)`%**.
2. There does appear to be a linear relationship between payroll and winning percentage, in a given year. The average Person correlation coefficient is **`r round(avg_person_cor,3)`**.
## Prediction
Let's build a linear model to predict winning percentage for each of the 17 years. The best way to do this is using [nested data frames (tidyr), purrr, and broom](https://blog.rstudio.org/2016/02/02/tidyr-0-4-0/).
```{r}
lm_by_year <-
ml_pay2 %>%
filter(metric %in% c("payroll","avgwin")) %>%
select(-payroll, -avgwin) %>%
spread(metric, value) %>%
group_by(year) %>%
nest() %>%
mutate(
model = purrr::map(data, ~ lm(avgwin ~ payroll, data = .))
)
```
Below is a summary of each model. It appears that some of the models are significant at 95% confidence level, but a number of models are not, notably 2012, 2014, and 2000. If we look back at the Payroll vs Winning Percentage plot, we can see that correlation for these years are lower than others.
```{r}
lm_by_year %>%