# Data wrangling

Open RStudio.

Open a new R script in R and **save it as** `wpa_4_LastFirst.R` (where Last and First is your last and first name). 

Careful about: capitalizing, last and first name order, and using `_` instead of `-`.

At the top of your script, write the following (**with appropriate changes**):

In [3]:
# Assignment: WPA 4
# Name: Laura Fontanesi
# Date: 23 March 2020

#### Load some data in R

In [4]:
library(tidyverse)

# Load data in R
survey_data = read_csv("https://raw.githubusercontent.com/laurafontanesi/r-seminar/master/data/ccam.csv")

glimpse(survey_data)

“Missing column names filled in: 'X1' [1]”

[36m──[39m [1m[1mColumn specification[1m[22m [36m─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
cols(
  .default = col_character(),
  X1 = [32mcol_double()[39m,
  case_ID = [32mcol_double()[39m,
  year = [32mcol_double()[39m,
  weight_wave = [32mcol_double()[39m,
  weight_aggregate = [32mcol_double()[39m,
  reg_coal_emissions = [33mcol_logical()[39m,
  hear_GW_media = [33mcol_logical()[39m,
  age = [32mcol_double()[39m,
  religion_other_nonchristian = [33mcol_logical()[39m,
  house_size = [32mcol_double()[39m,
  house_ages0to1 = [32mcol_double()[39m,
  house_ages2to5 = [32mcol_double()[39m,
  house_ages6to12 = [32mcol_double()[39m,
  house_ages13to17 = [32mcol_double()[39m,
  house_ages18plus = [32mcol_double()[39m
)
[36mℹ[39m Use [30m[4

Rows: 20,024
Columns: 55
$ X1                          [3m[90m<dbl>[39m[23m 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
$ case_ID                     [3m[90m<dbl>[39m[23m 2, 3, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 1…
$ wave                        [3m[90m<chr>[39m[23m "Nov 2008", "Nov 2008", "Nov 2008", "Nov 2…
$ year                        [3m[90m<dbl>[39m[23m 2008, 2008, 2008, 2008, 2008, 2008, 2008, …
$ weight_wave                 [3m[90m<dbl>[39m[23m 0.54, 0.85, 0.49, 0.29, 1.29, 2.56, 0.23, …
$ weight_aggregate            [3m[90m<dbl>[39m[23m 0.2939263, 0.4626617, 0.2667109, 0.1578493…
$ happening                   [3m[90m<chr>[39m[23m "Yes", "Don't know", "Don't know", "Yes", …
$ cause_original              [3m[90m<chr>[39m[23m "Caused mostly by human activities", "Caus…
$ cause_other_text            [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ cause_recoded               [3m[90m<chr>[39m[23m "Caused mostly by human activ

## 1. What is data wrangling?

Data wrangling is the process of transforming data from a raw format (in which they were collected) into another format with the intent of making it more appropriate and valuable for exploratory and confirmatory data analyses.

Until now, we only worked with "tidy" datasets, i.e., dataset that were already ready for being plotted/analysed. But this is an exeption. Especially when collecting your own data, but also when accessing another researcher's data, we typically have to go through a few steps before being able to run analysese or plotting.

Today's class is based on specific functions in the tidyverse that will serve exactly this purpuse.

## 2. Functions to know and some examples

Virtually all you need to know is in [this cheatsheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf).

The more frequent functions you will use are:

- `%>%`: The pipe operator, to chain functions (not recommended at the beginning).

- `arrange`: order the dataframe based on values of variables.

- `rename`: to rename specific columns.

- `filter` and `slice`: to select subsets of rows.

- `select`: to select subsets of columns.

- `mutate`: make new columns based on modifications of existing ones.

- `mutate_all`: make the same modification on all columns. 

- `mutate_if`: make the same modification on columns satisfying specific conditions. 

Look [here](https://dplyr.tidyverse.org/reference/mutate_all.html) for other kinds of mutate.

- `summarize`: to calculate summary statistics.

- `summarize_all`: to calculate summary statistics on all columns. 

- `summarize_if`: to calculate summary statistics on columns satisfying specific conditions. 

Look [here](https://dplyr.tidyverse.org/reference/summarise_all.html) for other kinds of summarize.

- `group_by`: to group the data based on some variables, so that subsequent calculations are done on such variables.

- `full_join`, `left_join`, `right_join`: to join separate dataframes.

- `bind_rows` and `bind_cols`: to append dataframes vertically or horizontally.

- `gather` and `spread`: bring wide form to long form and viceversa.

- `unite`: make a column from multiple columns.

- `separate`: make multiple columns from one column.

In [5]:
# only select columns of interest equivalent to data.frame[,interesting_columns]
interesting_columns = c('wave', 'year', 'happening', 'cause_recoded', 'sci_consensus', 'worry', 'harm_personally', 
                        'harm_US', 'harm_dev_countries', 'harm_future_gen', 'harm_plants_animals', 'when_harm_US',
                        'reg_CO2_pollutant', 'reg_utilities', 'fund_research', 'discuss_GW',
                        'gender', 'age_category', 'educ_category', 'income_category', 'race', 'party_x_ideo', 'region4', 'employment')

columns_to_drop = c('case_ID', 'weight_wave', 'weight_aggregate', 'cause_original', 'cause_other_text',
                    'reg_coal_emissions', 'hear_GW_media', 'age', 'generation', 'educ', 'income', 'ideology',
                    'party', 'party_w_leaners', 'registered_voter', 'region9', 'religion', 'religion_other_nonchristian',
                    'evangelical', 'service_attendance', 'marit_status', 'house_head', 'house_size',
                    'house_ages0to1', 'house_ages2to5', 'house_ages6to12', 'house_ages13to17', 'house_ages18plus',
                    'house_type', 'house_own')

new_survey_data = select(survey_data, all_of(interesting_columns))

#or:
new_survey_data = select(survey_data, -all_of(columns_to_drop))

glimpse(new_survey_data)

Rows: 20,024
Columns: 25
$ X1                  [3m[90m<dbl>[39m[23m 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
$ wave                [3m[90m<chr>[39m[23m "Nov 2008", "Nov 2008", "Nov 2008", "Nov 2008", "N…
$ year                [3m[90m<dbl>[39m[23m 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 20…
$ happening           [3m[90m<chr>[39m[23m "Yes", "Don't know", "Don't know", "Yes", "Yes", "…
$ cause_recoded       [3m[90m<chr>[39m[23m "Caused mostly by human activities", "Caused mostl…
$ sci_consensus       [3m[90m<chr>[39m[23m "Most scientists think global warming is happening…
$ worry               [3m[90m<chr>[39m[23m "Somewhat worried", "Not very worried", "Not at al…
$ harm_personally     [3m[90m<chr>[39m[23m "Only a little", "Only a little", "Not at all", "O…
$ harm_US             [3m[90m<chr>[39m[23m "A moderate amount", "Refused", "Not at all", "Onl…
$ harm_dev_countries  [3m[90m<chr>[39m[23m "A great deal", "Only a little", "Not

In [6]:
# sort by one variable, and only show the first 5 values
head(arrange(survey_data, happening), 6)

X1,case_ID,wave,year,weight_wave,weight_aggregate,happening,cause_original,cause_other_text,cause_recoded,⋯,employment,house_head,house_size,house_ages0to1,house_ages2to5,house_ages6to12,house_ages13to17,house_ages18plus,house_type,house_own
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
1,3,Nov 2008,2008,0.85,0.4626617,Don't know,Caused mostly by human activities,,Caused mostly by human activities,⋯,Not working - disabled,Head of household,2,0,0,0,0,2,Mobile home,Rented
2,5,Nov 2008,2008,0.49,0.2667109,Don't know,Caused mostly by natural changes in the environment,,Caused mostly by natural changes in the environment,⋯,Not working - looking for work,Head of household,2,0,0,0,0,2,One-family house detached from any other house,Owned by you or someone in your household
5,8,Nov 2008,2008,2.56,1.3934283,Don't know,Caused mostly by natural changes in the environment,,Caused mostly by natural changes in the environment,⋯,Working - self-employed,Head of household,3,0,0,1,0,2,One-family house attached to one or more houses (such as a condo or townhouse),Owned by you or someone in your household
15,20,Nov 2008,2008,0.82,0.4463325,Don't know,Caused mostly by natural changes in the environment,,Caused mostly by natural changes in the environment,⋯,Not working - other,Head of household,3,0,1,0,0,2,One-family house detached from any other house,Owned by you or someone in your household
26,32,Nov 2008,2008,0.77,0.4191171,Don't know,Caused mostly by natural changes in the environment,,Caused mostly by natural changes in the environment,⋯,Not working - retired,Head of household,2,0,0,0,0,2,Mobile home,Owned by you or someone in your household
33,40,Nov 2008,2008,1.73,0.9416527,Don't know,Caused mostly by natural changes in the environment,,Caused mostly by natural changes in the environment,⋯,Working - as a paid employee,Head of household,4,0,0,0,0,4,One-family house detached from any other house,Owned by you or someone in your household


In [7]:
# sort by 2 variables (one in opposite order), and only show the last 10 values
tail(arrange(survey_data, happening, desc(cause_recoded)), 10)

X1,case_ID,wave,year,weight_wave,weight_aggregate,happening,cause_original,cause_other_text,cause_recoded,⋯,employment,house_head,house_size,house_ages0to1,house_ages2to5,house_ages6to12,house_ages13to17,house_ages18plus,house_type,house_own
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
19879,35283,Oct 2017,2017,0.8294,0.7491838,Yes,Other (Please specify),A combination of both...,Caused by human activities and natural changes,⋯,Not working - retired,Head of household,3,0,0,0,0,3,One-family house detached from any other house,Rented
19880,35285,Oct 2017,2017,2.5982,2.3469125,Yes,Other (Please specify),caused by both human and natural activities,Caused by human activities and natural changes,⋯,Not working - looking for work,Head of household,4,0,0,1,1,2,One-family house attached to one or more houses (such as a condo or townhouse),Rented
19924,35332,Oct 2017,2017,1.0176,0.9191818,Yes,Other (Please specify),I think it's a combo of human and natural,Caused by human activities and natural changes,⋯,Working - self-employed,Head of household,2,0,0,0,0,2,One-family house detached from any other house,Owned by you or someone in your household
19925,35334,Oct 2017,2017,0.9679,0.8742886,Yes,Other (Please specify),Combination of human and nature,Caused by human activities and natural changes,⋯,Working - as a paid employee,Head of household,3,1,0,0,0,2,One-family house attached to one or more houses (such as a condo or townhouse),Owned by you or someone in your household
19928,35337,Oct 2017,2017,0.6358,0.574308,Yes,Other (Please specify),both human activity and nature,Caused by human activities and natural changes,⋯,Working - as a paid employee,Head of household,1,0,0,0,0,1,Building with 2 or more apartments,Owned by you or someone in your household
19950,35365,Oct 2017,2017,1.2562,1.1347054,Yes,Other (Please specify),A little bit of both,Caused by human activities and natural changes,⋯,Working - as a paid employee,Head of household,3,0,0,0,0,3,One-family house detached from any other house,Rented
19960,35377,Oct 2017,2017,0.5589,0.5048454,Yes,Other (Please specify),"i understand iut to be a natyral , recurring phenomenon that currently is being exacerbated by human activity",Caused by human activities and natural changes,⋯,Working - as a paid employee,Head of household,2,0,0,0,0,2,Building with 2 or more apartments,Rented
19974,35396,Oct 2017,2017,0.8183,0.7391573,Yes,Other (Please specify),a combination of both humand and natural changes,Caused by human activities and natural changes,⋯,Working - as a paid employee,Head of household,6,0,0,0,0,6,One-family house detached from any other house,Owned by you or someone in your household
20008,35436,Oct 2017,2017,0.8983,0.81142,Yes,Other (Please specify),Both human activities and natural causes,Caused by human activities and natural changes,⋯,Working - as a paid employee,Head of household,5,0,0,0,2,3,One-family house detached from any other house,Owned by you or someone in your household
20020,35451,Oct 2017,2017,0.9108,0.8227111,Yes,Other (Please specify),combination of natural and human activities,Caused by human activities and natural changes,⋯,Working - as a paid employee,Head of household,3,0,0,0,0,3,One-family house attached to one or more houses (such as a condo or townhouse),Rented


In [8]:
# another way to select rows (equivalent to data.frame[200:210,])
slice(survey_data, 200:210)

X1,case_ID,wave,year,weight_wave,weight_aggregate,happening,cause_original,cause_other_text,cause_recoded,⋯,employment,house_head,house_size,house_ages0to1,house_ages2to5,house_ages6to12,house_ages13to17,house_ages18plus,house_type,house_own
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
199,217,Nov 2008,2008,0.78,0.4245602,Yes,Other (Please specify),human and natural occurences,Caused by human activities and natural changes,⋯,Working - as a paid employee,Not head of household,4,0,0,0,0,4,One-family house detached from any other house,Owned by you or someone in your household
200,218,Nov 2008,2008,0.42,0.2286093,Don't know,Caused mostly by natural changes in the environment,,Caused mostly by natural changes in the environment,⋯,Working - as a paid employee,Head of household,3,0,0,0,0,3,One-family house detached from any other house,Rented
201,219,Nov 2008,2008,0.67,0.3646863,No,Caused mostly by natural changes in the environment,,Caused mostly by natural changes in the environment,⋯,Not working - disabled,Not head of household,1,0,0,0,0,1,One-family house detached from any other house,Rented
202,220,Nov 2008,2008,0.87,0.4735479,Don't know,Caused mostly by natural changes in the environment,,Caused mostly by natural changes in the environment,⋯,Not working - looking for work,Head of household,4,0,0,2,0,2,One-family house detached from any other house,Owned by you or someone in your household
203,221,Nov 2008,2008,0.93,0.5062064,No,Caused mostly by natural changes in the environment,,Caused mostly by natural changes in the environment,⋯,Working - as a paid employee,Head of household,2,0,0,0,0,2,One-family house detached from any other house,Owned by you or someone in your household
204,222,Nov 2008,2008,0.77,0.4191171,Yes,Caused mostly by human activities,,Caused mostly by human activities,⋯,Not working - looking for work,Head of household,6,0,0,1,1,4,One-family house detached from any other house,Owned by you or someone in your household
205,223,Nov 2008,2008,2.09,1.1376036,Yes,Caused mostly by human activities,,Caused mostly by human activities,⋯,Not working - other,Head of household,4,0,1,1,0,2,One-family house detached from any other house,Owned by you or someone in your household
206,224,Nov 2008,2008,0.79,0.4300033,Yes,Other (Please specify),caused by a combination of both humans and nature,Caused by human activities and natural changes,⋯,Working - as a paid employee,Head of household,4,0,0,0,0,4,One-family house detached from any other house,Owned by you or someone in your household
207,225,Nov 2008,2008,2.35,1.2791236,Yes,Caused mostly by human activities,,Caused mostly by human activities,⋯,Not working - other,Head of household,1,0,0,0,0,1,One-family house attached to one or more houses (such as a condo or townhouse),Owned by you or someone in your household
208,226,Nov 2008,2008,0.59,0.3211417,Yes,Caused mostly by human activities,,Caused mostly by human activities,⋯,Not working - retired,Head of household,1,0,0,0,0,1,One-family house detached from any other house,Owned by you or someone in your household


In [9]:
# eliminate people (i.e., rows) who didn't reply in some variables of interest (an absence of reply was coded as -1 here)
survey_data = filter(survey_data, 
                     happening > 0, 
                     cause_recoded > 0, 
                     sci_consensus > 0, 
                     worry > 0)

glimpse(survey_data)

Rows: 18,694
Columns: 55
$ X1                          [3m[90m<dbl>[39m[23m 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
$ case_ID                     [3m[90m<dbl>[39m[23m 2, 3, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 1…
$ wave                        [3m[90m<chr>[39m[23m "Nov 2008", "Nov 2008", "Nov 2008", "Nov 2…
$ year                        [3m[90m<dbl>[39m[23m 2008, 2008, 2008, 2008, 2008, 2008, 2008, …
$ weight_wave                 [3m[90m<dbl>[39m[23m 0.54, 0.85, 0.49, 0.29, 1.29, 2.56, 0.23, …
$ weight_aggregate            [3m[90m<dbl>[39m[23m 0.2939263, 0.4626617, 0.2667109, 0.1578493…
$ happening                   [3m[90m<chr>[39m[23m "Yes", "Don't know", "Don't know", "Yes", …
$ cause_original              [3m[90m<chr>[39m[23m "Caused mostly by human activities", "Caus…
$ cause_other_text            [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ cause_recoded               [3m[90m<chr>[39m[23m "Caused mostly by human activ

In [10]:
# convert all data to integers type
survey_data = mutate_all(survey_data, 
                         as.integer)

glimpse(survey_data)

“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduce

Rows: 18,694
Columns: 55
$ X1                          [3m[90m<int>[39m[23m 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
$ case_ID                     [3m[90m<int>[39m[23m 2, 3, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 1…
$ wave                        [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ year                        [3m[90m<int>[39m[23m 2008, 2008, 2008, 2008, 2008, 2008, 2008, …
$ weight_wave                 [3m[90m<int>[39m[23m 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 1, 0, 0, …
$ weight_aggregate            [3m[90m<int>[39m[23m 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ happening                   [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ cause_original              [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ cause_other_text            [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ cause_recoded               [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, N

In [11]:
# add new columns
# note that when you use the old name, the column is replaced with the new one

survey_data = mutate(survey_data, 
                     happening_cont = happening + rnorm(mean=0, sd=.5, n=nrow(survey_data)),
                     worry_cont = worry + rnorm(mean=0, sd=.5, n=nrow(survey_data)),
                     year=recode(year, 
                                 `1` = 2008, 
                                 `2` = 2010,
                                 `3` = 2011,
                                 `4` = 2012, 
                                 `5` = 2013,
                                 `6` = 2014,
                                 `7` = 2015,
                                 `8` = 2016,
                                 `9` = 2017),
                     happening_labels=recode(happening,
                                             `1` = "no",
                                             `2` = "dont know",
                                             `3` = "yes"),
                     cause_recoded=recode(cause_recoded,
                                          `1` = "dont know",
                                          `2` = "other",
                                          `3` = "not happening",
                                          `4` = "natural",
                                          `5` = "human",
                                          `6` = "natural and human"),
                     sci_consensus=recode(sci_consensus,
                                          `1` = "dont know",
                                          `2` = "disagreement",
                                          `3` = "not happening",
                                          `4` = "happening"),
                     gender=recode(gender,
                                   `1` = "male",
                                   `2` = "female"),
                     age_category_labels=recode(age_category,
                                                `1` = "18-34",
                                                `2` = "35-54",
                                                `3` = "55+"),
                     educ_category_labels=recode(educ_category,
                                                 `1` = "no highschool",
                                                 `2` = "highschool",
                                                 `3` = "college",
                                                 `4` = "bachelor or higher"),
                     income_category_labels=recode(income_category,
                                                   `1` = "less 50000",
                                                   `2` = "50000-99999",
                                                   `3` = "more 100000"),
                     race=recode(race,
                                 `1` = 'white non hisp',
                                 `2` = 'black non hisp',
                                 `3` = 'other non hisp',
                                 `4` = 'hisp'),
                     party_x_ideo=recode(party_x_ideo,
                                         `-2` = "no interest",
                                         `-1` = "refused",
                                         `1` = "liberal democrat",
                                         `2` = "moderate democrate",
                                         `3` = "independent",
                                         `4` = "moderate republican",
                                         `5` = "conservative republican"),
                     region4 = recode(region4,
                                      `1` = "northeast",
                                      `2` = "midwest",
                                      `3` = "south",
                                      `4` = "west"),
                     employment = recode(employment,
                                         `1` = "Working/as a paid employee",
                                         `2` = "Working/selfemploye",
                                         `3` = "Not working/temporary",
                                         `4` = "Not working/looking",
                                         `5` = "Not working/retired",
                                         `6` = "Not working/disabled",
                                         `7` = "Not working/other"))

glimpse(survey_data)

“Unreplaced values treated as NA as .x is not compatible. Please specify replacements exhaustively or supply .default”


Rows: 18,694
Columns: 61
$ X1                          [3m[90m<int>[39m[23m 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
$ case_ID                     [3m[90m<int>[39m[23m 2, 3, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 1…
$ wave                        [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ year                        [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ weight_wave                 [3m[90m<int>[39m[23m 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 1, 0, 0, …
$ weight_aggregate            [3m[90m<int>[39m[23m 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ happening                   [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ cause_original              [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ cause_other_text            [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ cause_recoded               [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, N

In [17]:
# get summary statistics
summarise(survey_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))

count,min_worry,quantile25_worry,mean_worry,quantile75_worry,max_worry,min_happening,quantile25_happening,mean_happening,quantile75_happening,max_happening
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
18694,,,,,,,,,,


In [19]:
# get summary statistics per group
grouped_data = group_by(survey_data, year)

summarise(grouped_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))

Unnamed: 0_level_0,year,count,min_worry,quantile25_worry,mean_worry,quantile75_worry,max_worry,min_happening,quantile25_happening,mean_happening,quantile75_happening,max_happening
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,,18694,,,,,,,,,,


In [20]:
# split columns
survey_data = separate(survey_data,
                       col = employment,
                       into = c('working', 'working_type'),
                       sep = '/',
                       remove = FALSE)
head(survey_data)

X1,case_ID,wave,year,weight_wave,weight_aggregate,happening,cause_original,cause_other_text,cause_recoded,⋯,house_ages13to17,house_ages18plus,house_type,house_own,happening_cont,worry_cont,happening_labels,age_category_labels,educ_category_labels,income_category_labels
<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<chr>,⋯,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>
0,2,,,0,0,,,,,⋯,0,3,,,,,,,,
1,3,,,0,0,,,,,⋯,0,2,,,,,,,,
2,5,,,0,0,,,,,⋯,0,2,,,,,,,,
3,6,,,0,0,,,,,⋯,0,2,,,,,,,,
4,7,,,1,0,,,,,⋯,0,2,,,,,,,,
5,8,,,2,1,,,,,⋯,0,2,,,,,,,,


In [21]:
# you can also create a new column via mutate + case_when
survey_data = mutate(survey_data,
                     working_recoded = case_when(working == "Not working" & working_type == 'looking' ~ 0,
                                                 working == "Not working" & working_type != 'looking' ~ 1,
                                                 working == "Working" ~ 2))
                     
head(survey_data)

X1,case_ID,wave,year,weight_wave,weight_aggregate,happening,cause_original,cause_other_text,cause_recoded,⋯,house_ages18plus,house_type,house_own,happening_cont,worry_cont,happening_labels,age_category_labels,educ_category_labels,income_category_labels,working_recoded
<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<chr>,⋯,<int>,<int>,<int>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>
0,2,,,0,0,,,,,⋯,3,,,,,,,,,
1,3,,,0,0,,,,,⋯,2,,,,,,,,,
2,5,,,0,0,,,,,⋯,2,,,,,,,,,
3,6,,,0,0,,,,,⋯,2,,,,,,,,,
4,7,,,1,0,,,,,⋯,2,,,,,,,,,
5,8,,,2,1,,,,,⋯,2,,,,,,,,,


In [23]:
# get summary statistics per group
grouped_data = group_by(survey_data, working)

summarise(grouped_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))

Unnamed: 0_level_0,working,count,min_worry,quantile25_worry,mean_worry,quantile75_worry,max_worry,min_happening,quantile25_happening,mean_happening,quantile75_happening,max_happening
Unnamed: 0_level_1,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,,18694,,,,,,,,,,


In [24]:
# first filter, and then get summary statistics per group
not_working_data = filter(survey_data, working == "Not working")

grouped_data = group_by(not_working_data, working_type)

summarise(grouped_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))

“no non-missing arguments to min; returning Inf”
“no non-missing arguments to max; returning -Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to max; returning -Inf”


working_type,count,min_worry,quantile25_worry,mean_worry,quantile75_worry,max_worry,min_happening,quantile25_happening,mean_happening,quantile75_happening,max_happening
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>


In [25]:
# merge two columns
survey_data = unite(survey_data,
                    "race_gender",
                    race, 
                    gender,
                    sep = "_",
                    remove=FALSE)

head(select(survey_data, race, gender, race_gender))

race,gender,race_gender
<chr>,<chr>,<chr>
,,NA_NA
,,NA_NA
,,NA_NA
,,NA_NA
,,NA_NA
,,NA_NA


In [26]:
# get summary statistics per group
grouped_data = group_by(survey_data, race_gender)

summarise(grouped_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))

Unnamed: 0_level_0,race_gender,count,min_worry,quantile25_worry,mean_worry,quantile75_worry,max_worry,min_happening,quantile25_happening,mean_happening,quantile75_happening,max_happening
Unnamed: 0_level_1,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,NA_NA,18694,,,,,,,,,,


In [27]:
# similar results with 2 grouping variables in this case:

# get summary statistics per group
grouped_data = group_by(survey_data, race, gender)

summarise(grouped_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))

`summarise()` has grouped output by 'race'. You can override using the `.groups` argument.



race,gender,count,min_worry,quantile25_worry,mean_worry,quantile75_worry,max_worry,min_happening,quantile25_happening,mean_happening,quantile75_happening,max_happening
<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,18694,,,,,,,,,,


### Do everything again but with pipes:

In [28]:
survey_data = read_csv("https://raw.githubusercontent.com/laurafontanesi/r-seminar/master/data/ccam.csv")

survey_data = survey_data %>%
    select(interesting_columns) %>%
    filter(happening > 0, 
           cause_recoded > 0, 
           sci_consensus > 0, 
           worry > 0)  %>%
    mutate_all(as.integer) %>%
    mutate(happening_cont = happening + rnorm(mean=0, sd=.5, n=n()),
           worry_cont = worry + rnorm(mean=0, sd=.5, n=n()),
           year=recode(year, 
                       `1` = 2008, 
                       `2` = 2010,
                       `3` = 2011,
                       `4` = 2012, 
                       `5` = 2013,
                       `6` = 2014,
                       `7` = 2015,
                       `8` = 2016,
                       `9` = 2017),
           happening_labels=recode(happening,
                                   `1` = "no",
                                   `2` = "dont know",
                                   `3` = "yes"),
           cause_recoded=recode(cause_recoded,
                                `1` = "dont know",
                                `2` = "other",
                                `3` = "not happening",
                                `4` = "natural",
                                `5` = "human",
                                `6` = "natural and human"),
           sci_consensus=recode(sci_consensus,
                                `1` = "dont know",
                                `2` = "disagreement",
                                `3` = "not happening",
                                `4` = "happening"),
           gender=recode(gender,
                         `1` = "male",
                         `2` = "female"),
           age_category_labels=recode(age_category,
                                      `1` = "18-34",
                                      `2` = "35-54",
                                      `3` = "55+"),
           educ_category_labels=recode(educ_category,
                                       `1` = "no highschool",
                                       `2` = "highschool",
                                       `3` = "college",
                                       `4` = "bachelor or higher"),
           income_category_labels=recode(income_category,
                        `1` = "less 50000",
                        `2` = "50000-99999",
                        `3` = "more 100000"),
           race=recode(race,
                      `1` = 'white non hisp',
                      `2` = 'black non hisp',
                      `3` = 'other non hisp',
                      `4` = 'hisp'),
           party_x_ideo=recode(party_x_ideo,
                               `-2` = "no interest",
                               `-1` = "refused",
                               `1` = "liberal democrat",
                               `2` = "moderate democrate",
                               `3` = "independent",
                               `4` = "moderate republican",
                               `5` = "conservative republican"),
           region4 = recode(region4,
                            `1` = "northeast",
                            `2` = "midwest",
                            `3` = "south",
                            `4` = "west"),
           employment = recode(employment,
                               `1` = "Working/as a paid employee",
                               `2` = "Working/selfemploye",
                               `3` = "Not working/temporary",
                               `4` = "Not working/looking",
                               `5` = "Not working/retired",
                               `6` = "Not working/disabled",
                               `7` = "Not working/other")) %>%
    separate(col = employment,
             into = c('working', 'working_type'),
             sep = '/') %>%
    glimpse()  %>%
    filter(working == "Not working")  %>%
    group_by(working_type) %>%
    summarise(count=n(), 
              min_worry=min(worry_cont),
              quantile25_worry=quantile(worry_cont, .25),
              mean_worry=mean(worry_cont), 
              quantile75_worry=quantile(worry_cont, .75),
              max_worry=max(worry_cont),
              min_happening=min(happening_cont),
              quantile25_happening=quantile(happening_cont, .25),
              mean_happening=mean(happening_cont),
              quantile75_happening=quantile(happening_cont, .75),
              max_happening=max(happening_cont))


[36m──[39m [1m[1mColumn specification[1m[22m [36m─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
cols(
  .default = col_character(),
  case_ID = [32mcol_double()[39m,
  wave = [32mcol_double()[39m,
  weight_wave = [32mcol_double()[39m,
  weight_aggregate = [32mcol_double()[39m,
  happening = [32mcol_double()[39m,
  discuss_GW = [33mcol_logical()[39m,
  gender = [33mcol_logical()[39m,
  age_category = [32mcol_double()[39m,
  evangelical = [33mcol_logical()[39m,
  house_ages0to1 = [32mcol_double()[39m,
  house_ages2to5 = [32mcol_double()[39m,
  house_ages6to12 = [32mcol_double()[39m,
  house_ages13to17 = [32mcol_double()[39m,
  house_ages18plus = [32mcol_double()[39m,
  house_type = [32mcol_double()[39m
)
[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m for the full colu

Rows: 1,665
Columns: 31
$ wave                   [3m[90m<int>[39m[23m 19, 24, 43, 53, 74, 81, 88, 90, 95, 107, 139, 1…
$ year                   [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ happening              [3m[90m<int>[39m[23m 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ cause_recoded          [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ sci_consensus          [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ worry                  [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ harm_personally        [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ harm_US                [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ harm_dev_countries     [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ harm_future_gen        [3m[90m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA,

“no non-missing arguments to min; returning Inf”
“no non-missing arguments to max; returning -Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to max; returning -Inf”


In [29]:
survey_data

working_type,count,min_worry,quantile25_worry,mean_worry,quantile75_worry,max_worry,min_happening,quantile25_happening,mean_happening,quantile75_happening,max_happening
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>


### Wide and long format

In [30]:
# Create fake data:
N = 15
mean_score_students = runif(N, 5, 10)
fake_data_wide = tibble( # same as data.frame but in tidyverse!
    student = 1:N,
    age = round(rnorm(N, 30, 5)),
    score_wpa1 = mean_score_students,
    score_wpa2 = mean_score_students*0.9 + rnorm(N, 0, 0.1),
    score_wpa3 = mean_score_students*0.5 + rnorm(N, 0, 1),
    score_wpa4 = mean_score_students*0.7 + rnorm(N, 0, 0.2)
)

fake_data_wide

student,age,score_wpa1,score_wpa2,score_wpa3,score_wpa4
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,24,5.350302,4.82346,1.348207,4.074634
2,32,9.289347,8.239027,4.434117,6.508575
3,28,6.379613,5.770534,4.299794,4.183542
4,30,6.076898,5.352874,2.835648,4.067586
5,32,7.225936,6.433638,4.978435,5.196511
6,23,5.101863,4.591889,1.562291,3.84082
7,19,9.666794,8.728218,4.471286,6.60966
8,31,6.18785,5.694937,2.814672,4.148429
9,32,6.475096,5.909518,1.925187,4.264363
10,24,5.522121,5.05283,2.767325,4.056865


In [31]:
# transform from wide to long format
fake_data_long = gather(fake_data_wide,
                        key='wpa',
                        value='score',
                        score_wpa1:score_wpa4)
head(fake_data_long, 10)

student,age,wpa,score
<int>,<dbl>,<chr>,<dbl>
1,24,score_wpa1,5.350302
2,32,score_wpa1,9.289347
3,28,score_wpa1,6.379613
4,30,score_wpa1,6.076898
5,32,score_wpa1,7.225936
6,23,score_wpa1,5.101863
7,19,score_wpa1,9.666794
8,31,score_wpa1,6.18785
9,32,score_wpa1,6.475096
10,24,score_wpa1,5.522121


In [32]:
# reorder based on student
head(arrange(fake_data_long, student), 20)

student,age,wpa,score
<int>,<dbl>,<chr>,<dbl>
1,24,score_wpa1,5.350302
1,24,score_wpa2,4.82346
1,24,score_wpa3,1.348207
1,24,score_wpa4,4.074634
2,32,score_wpa1,9.289347
2,32,score_wpa2,8.239027
2,32,score_wpa3,4.434117
2,32,score_wpa4,6.508575
3,28,score_wpa1,6.379613
3,28,score_wpa2,5.770534


In [33]:
# bring back
fake_data_wide = spread(fake_data_long,
                        key='wpa',
                        value='score')
head(fake_data_wide, 5)

student,age,score_wpa1,score_wpa2,score_wpa3,score_wpa4
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,24,5.350302,4.82346,1.348207,4.074634
2,32,9.289347,8.239027,4.434117,6.508575
3,28,6.379613,5.770534,4.299794,4.183542
4,30,6.076898,5.352874,2.835648,4.067586
5,32,7.225936,6.433638,4.978435,5.196511


### Join dataframes

In [34]:
# Create fake data:
mean_score_students = runif(N, 5, 10)
fake_data_wide_second_batch = tibble( # same as data.frame but in tidyverse!
    student = (N+1):(2*N),
    age = round(rnorm(N, 30, 5)),
    score_wpa1 = mean_score_students,
    score_wpa2 = mean_score_students*0.9 + rnorm(N, 0, 0.1),
    score_wpa3 = mean_score_students*0.5 + rnorm(N, 0, 1)
)
fake_data_wide_second_batch

student,age,score_wpa1,score_wpa2,score_wpa3
<int>,<dbl>,<dbl>,<dbl>,<dbl>
16,29,5.837102,5.172803,4.653112
17,30,7.682111,6.791576,3.170778
18,29,7.860678,6.947076,3.514781
19,34,8.563173,7.730741,4.724032
20,34,6.030106,5.436123,2.798378
21,38,5.839637,5.336008,3.121091
22,29,9.387974,8.398024,3.431383
23,23,8.433041,7.557266,2.467099
24,31,7.23284,6.579067,2.948424
25,29,9.743257,8.805672,4.002627


In [35]:
fake_data_wide = bind_rows(fake_data_wide, fake_data_wide_second_batch)
fake_data_wide

student,age,score_wpa1,score_wpa2,score_wpa3,score_wpa4
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,24,5.350302,4.82346,1.348207,4.074634
2,32,9.289347,8.239027,4.434117,6.508575
3,28,6.379613,5.770534,4.299794,4.183542
4,30,6.076898,5.352874,2.835648,4.067586
5,32,7.225936,6.433638,4.978435,5.196511
6,23,5.101863,4.591889,1.562291,3.84082
7,19,9.666794,8.728218,4.471286,6.60966
8,31,6.18785,5.694937,2.814672,4.148429
9,32,6.475096,5.909518,1.925187,4.264363
10,24,5.522121,5.05283,2.767325,4.056865


In [36]:
# new variables
new_info = tibble(
    student = 1:(2*N),
    score_wpa5 = mean_score_students*0.5 + rnorm(2*N, 0, 1),
    gender = rbinom(2*N, 1, .5)
)

In [37]:
new_info

student,score_wpa5,gender
<int>,<dbl>,<int>
1,2.656846,1
2,3.681722,1
3,3.30201,1
4,4.607683,1
5,4.04816,0
6,3.912829,0
7,3.988288,1
8,5.463488,0
9,3.099497,1
10,5.176658,1


In [38]:
fake_data_wide = bind_cols(fake_data_wide,
                           new_info)
fake_data_wide

New names:
* student -> student...1
* student -> student...7



student...1,age,score_wpa1,score_wpa2,score_wpa3,score_wpa4,student...7,score_wpa5,gender
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>
1,24,5.350302,4.82346,1.348207,4.074634,1,2.656846,1
2,32,9.289347,8.239027,4.434117,6.508575,2,3.681722,1
3,28,6.379613,5.770534,4.299794,4.183542,3,3.30201,1
4,30,6.076898,5.352874,2.835648,4.067586,4,4.607683,1
5,32,7.225936,6.433638,4.978435,5.196511,5,4.04816,0
6,23,5.101863,4.591889,1.562291,3.84082,6,3.912829,0
7,19,9.666794,8.728218,4.471286,6.60966,7,3.988288,1
8,31,6.18785,5.694937,2.814672,4.148429,8,5.463488,0
9,32,6.475096,5.909518,1.925187,4.264363,9,3.099497,1
10,24,5.522121,5.05283,2.767325,4.056865,10,5.176658,1


A better way to join dataframes is to use the "join" functions:

In [39]:
# Create fake data:
N = 15
mean_score_students = runif(N, 5, 10)
first_batch = tibble(
    student = 1:N,
    age = round(rnorm(N, 30, 5)),
    score_wpa1 = mean_score_students,
    score_wpa2 = mean_score_students*0.9 + rnorm(N, 0, 0.1),
    score_wpa3 = mean_score_students*0.5 + rnorm(N, 0, 1),
    score_wpa4 = mean_score_students*0.7 + rnorm(N, 0, 0.2)
)
M = 5
mean_score_students = runif(M, 8, 10)
second_batch = tibble( 
    student = (N+1):(N+M),
    age = round(rnorm(M, 30, 5)),
    score_wpa1 = mean_score_students,
    score_wpa2 = mean_score_students*0.9 + rnorm(M, 0, 0.1),
    score_wpa5 = mean_score_students*0.5 + rnorm(M, 0, 1)
)
new_info = tibble(
    student = 1:(N+M),
    score_wpa6 = runif(N+M, 6, 10)*0.5 + rnorm(N+M, 0, 1),
    gender = rbinom(N+M, 1, .5)
)

In [40]:
first_batch

student,age,score_wpa1,score_wpa2,score_wpa3,score_wpa4
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,26,6.585973,5.803681,3.637984,4.724408
2,22,6.986269,6.296002,3.045269,5.252876
3,32,9.088943,8.164221,5.398606,6.309631
4,37,5.333833,4.906472,1.046895,3.624793
5,33,5.271735,4.761087,1.251065,3.645108
6,31,8.893446,8.11618,5.282504,6.130001
7,27,7.755469,7.091476,3.557465,5.256445
8,37,5.817034,5.260826,3.789399,4.186599
9,33,8.219542,7.322283,4.228375,5.405467
10,37,8.722979,7.81372,4.452039,5.967142


In [41]:
second_batch

student,age,score_wpa1,score_wpa2,score_wpa5
<int>,<dbl>,<dbl>,<dbl>,<dbl>
16,23,9.260536,8.195728,4.573286
17,32,9.82995,8.748276,4.73474
18,37,8.047509,7.323552,1.739104
19,28,8.987178,8.162175,4.004146
20,37,9.507381,8.499145,3.315042


In [42]:
new_info

student,score_wpa6,gender
<int>,<dbl>,<int>
1,2.287797,0
2,4.086398,1
3,5.894512,1
4,3.907379,1
5,1.814034,1
6,2.796843,1
7,5.031935,0
8,4.140034,0
9,3.836112,1
10,5.271993,0


In [43]:
full_join(first_batch, second_batch, by='student')

student,age.x,score_wpa1.x,score_wpa2.x,score_wpa3,score_wpa4,age.y,score_wpa1.y,score_wpa2.y,score_wpa5
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,26.0,6.585973,5.803681,3.637984,4.724408,,,,
2,22.0,6.986269,6.296002,3.045269,5.252876,,,,
3,32.0,9.088943,8.164221,5.398606,6.309631,,,,
4,37.0,5.333833,4.906472,1.046895,3.624793,,,,
5,33.0,5.271735,4.761087,1.251065,3.645108,,,,
6,31.0,8.893446,8.11618,5.282504,6.130001,,,,
7,27.0,7.755469,7.091476,3.557465,5.256445,,,,
8,37.0,5.817034,5.260826,3.789399,4.186599,,,,
9,33.0,8.219542,7.322283,4.228375,5.405467,,,,
10,37.0,8.722979,7.81372,4.452039,5.967142,,,,


In [44]:
full_join(first_batch, second_batch, by=c('student', 'age', "score_wpa1", "score_wpa2"))

student,age,score_wpa1,score_wpa2,score_wpa3,score_wpa4,score_wpa5
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,26,6.585973,5.803681,3.637984,4.724408,
2,22,6.986269,6.296002,3.045269,5.252876,
3,32,9.088943,8.164221,5.398606,6.309631,
4,37,5.333833,4.906472,1.046895,3.624793,
5,33,5.271735,4.761087,1.251065,3.645108,
6,31,8.893446,8.11618,5.282504,6.130001,
7,27,7.755469,7.091476,3.557465,5.256445,
8,37,5.817034,5.260826,3.789399,4.186599,
9,33,8.219542,7.322283,4.228375,5.405467,
10,37,8.722979,7.81372,4.452039,5.967142,


In [45]:
left_join(first_batch, new_info, by=c('student'))

student,age,score_wpa1,score_wpa2,score_wpa3,score_wpa4,score_wpa6,gender
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,26,6.585973,5.803681,3.637984,4.724408,2.287797,0
2,22,6.986269,6.296002,3.045269,5.252876,4.086398,1
3,32,9.088943,8.164221,5.398606,6.309631,5.894512,1
4,37,5.333833,4.906472,1.046895,3.624793,3.907379,1
5,33,5.271735,4.761087,1.251065,3.645108,1.814034,1
6,31,8.893446,8.11618,5.282504,6.130001,2.796843,1
7,27,7.755469,7.091476,3.557465,5.256445,5.031935,0
8,37,5.817034,5.260826,3.789399,4.186599,4.140034,0
9,33,8.219542,7.322283,4.228375,5.405467,3.836112,1
10,37,8.722979,7.81372,4.452039,5.967142,5.271993,0


In [46]:
right_join(new_info, second_batch, by=c('student'))

student,score_wpa6,gender,age,score_wpa1,score_wpa2,score_wpa5
<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
16,4.854776,0,23,9.260536,8.195728,4.573286
17,3.035959,1,32,9.82995,8.748276,4.73474
18,5.270939,0,37,8.047509,7.323552,1.739104
19,5.186211,0,28,8.987178,8.162175,4.004146
20,4.528591,0,37,9.507381,8.499145,3.315042


## 3. Now it's your turn

Now you will analyze data from Matthews et al. (2016): Why do we overestimate others' willingness to pay? The purpose of this research was to test if our beliefs about other people's affluence (i.e.; wealth) affect how much we think they will be willing to pay for items. You can find the full paper at http://journal.sjdm.org/15/15909/jdm15909.pdf.

**Variables description:**

Here are descriptions of the data variables (taken from the author's dataset notes available at http://journal.sjdm.org/15/15909/Notes.txt)

- `id`: participant id code
- `gender`: participant gender. 1 = male, 2 = female
- `age`: participant age
- `income`: participant annual household income on categorical scale with 8 categorical options: Less than 5,000; 15,001–25,000; 25,001–35,000; 35,001–50,000; 50,001–75,000; 75,001–100,000; 100,001–150,000; greater than 150,000.
- `p1-p10`: whether the "typical" survey respondent would pay more (coded 1) or less (coded 0) than oneself, for each of the 10 products 
- `task`: whether the participant had to judge the proportion of other people who "have more money than you do" (coded 1) or the proportion who "have less money than you do" (coded 0)
- `havemore`: participant's response when task = 1
- `haveless`: participant's response when task = 0
- `pcmore`: participant's estimate of the proportion of people who have more than they do (calculated as 100-haveless when task=0)

In [48]:
# load some data
matthews_data = read_csv('https://raw.githubusercontent.com/laurafontanesi/r-seminar/master/data/data_wpa4.csv')

demographics = read_csv("https://raw.githubusercontent.com/laurafontanesi/r-seminar/master/data/matthews_demographics.csv")

glimpse(matthews_data)

glimpse(demographics)


[36m──[39m [1m[1mColumn specification[1m[22m [36m─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
cols(
  id = [31mcol_character()[39m,
  gender = [32mcol_double()[39m,
  age = [32mcol_double()[39m,
  income = [32mcol_double()[39m,
  p1 = [32mcol_double()[39m,
  p2 = [32mcol_double()[39m,
  p3 = [32mcol_double()[39m,
  p4 = [32mcol_double()[39m,
  p5 = [32mcol_double()[39m,
  p6 = [32mcol_double()[39m,
  p7 = [32mcol_double()[39m,
  p8 = [32mcol_double()[39m,
  p9 = [32mcol_double()[39m,
  p10 = [32mcol_double()[39m,
  task = [32mcol_double()[39m,
  havemore = [32mcol_double()[39m,
  haveless = [32mcol_double()[39m,
  pcmore = [32mcol_double()[39m
)


“Missing column names filled in: 'X1' [1]”

[36m──[39m [1m[1mColumn specification[1m[22m [36m──────────────────────────────

Rows: 190
Columns: 18
$ id       [3m[90m<chr>[39m[23m "R_3PtNn51LmSFdLNM", "R_2AXrrg62pgFgtMV", "R_cwEOX3HgnMeVQHL"…
$ gender   [3m[90m<dbl>[39m[23m 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2…
$ age      [3m[90m<dbl>[39m[23m 26, 32, 25, 33, 24, 22, 47, 26, 29, 32, 29, 28, 31, 24, 25, 2…
$ income   [3m[90m<dbl>[39m[23m 7, 4, 2, 5, 1, 2, 3, 4, 1, 7, 4, 3, 2, 2, 6, 3, 2, 2, 1, 3, 3…
$ p1       [3m[90m<dbl>[39m[23m 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0…
$ p2       [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1…
$ p3       [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
$ p4       [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ p5       [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1…
$ p6       [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 

**Task A**

Note: do not use pipes from 1 to 4.

1. Currently `gender` is coded as 1 and 2. Create a new dataframe called `new_matthews_data`, in which there is a new column called `gender_labels` that codes gender as "male" and "female". Do it using `mutate`. Then, rename the original `gender` column to `gender_binary` using `rename`. Subtract 1 to all values of `gender_binary`, so that it is coded as 0 and 1 instead of 1 and 2 using `mutate` again.

2. In `new_matthews_data`, create new column called `income_labels` that codes income based on the data description above using `mutate`. Then, create a new column, called `income_recoded`, where you only have 4 income categories (coded as numbers from 1 to 4): below 25,000, 25,000-50,000, 50,000-100,000, and above 100,000 using `case_when`. How many observations are there for each of these 4 categories? Use `summarise` to reply.

3. In `new_matthews_data`, transform all numeric columns into integers numbers using `mutate_if`.

4. From `new_matthews_data`, create a summary of the dataset using `summarise`, to answer the following questions: What percent of participants were female? What was the minimum, mean, and maximum `income`? What was the 25th percentile, median, and the 75th percentile of `age`? Use good names for columns.

5. Repeat steps from 1 to 4 (apart from the `summarise` in point 2) using pipes and assign the result to `new_matthews_data_summary`.

**Task B**

1. From `new_matthews_data`, calculate the mean `p1` to `p10` across participants using `summarise_all` and `select`. Which product scored the highest? Do it again, grouping the data by gender. Is there a difference across gender? What is the mean of the mean `p1` to `p10` across participants? Calculate it on the result of the previous step. You can do these either using pipes or not.

2. Transform the data from wide to long format. In particular, you want 10 rows per subjects, with their responses on the products 1 to 10 in a column called `wtp`, and the product label in a column called `product`. Call the resulting dataframe `new_matthews_data_long`. Re-order it by `id`. Print the first 20 cases to check this worked. Check that `new_matthews_data_long` has 10 times more rows than `new_matthews_data`.

**Task C**

1. Drop the `X1` column in `demographics` using `select`.

2. Join `new_matthews_data_long` and `demographics` based on the `id`, in order to retain as many rows and columns as possible. Call the resulting dataframe `matthews_data_all`.

3. Calculate the mean `wtp` per subject using `group_by`. You can use pipes or not. Called the resulting dataframe `mean_matthews_data_all`. This should have as many rows as the number of subjects and 2 columns (`id` and mean wtp). Add as a third and fourth columns `heigth` and `race` using one of the join functions.

4. Using `mean_matthews_data_all`, make a barplot showing the mean `wtp` across ethnic groups. Plot confidence intervals. Give appropriate labels to the plot. Do you think there is a difference in willingness to pay across groups?

5. Using `mean_matthews_data_all`, make a scatterplot showing the wtp on the y-axis and the height on the x-axis. Add a regression line. Do you think height predicts willingness to pay?

## Submit your assignment

Save and email your script to me at [laura.fontanesi@unibas.ch](mailto:laura.fontanesi@unibas.ch) by the end of Friday.