# Variability: original data prep

This notebook tidies up the original data.

In [14]:
set more off, permanently
set scheme s1mono


(set more preference recorded)



Valid responses are defined as those that:

* don't have missing data (the release app didn't allow missing responses)
* have at least one and no more than 12 activities
* are made within one hour of a beep
* are completed within a further five minutes

On the additional data cleaning rules:

* under-18s weren't supposed to sign up, over-99s seem more likely an input error than a true age
* the project didn't launch until mid-August 2010, so anything prior is test data
* 2017 + 2018 together comprise only 0.8% of the data

In [15]:
use "mappiness.dta", replace
keep if valid_response

drop if age_at_signup < 18 | age_at_signup > 99 | response_td < date("2010-08-14", "YMD") | response_td > date("2016-12-31", "YMD")
format %tdDay_DD_Mon_CCYY response_td



(5,233,285 observations deleted)

(77,706 observations deleted)



Now we keep only the variables required for this analysis:

In [16]:
#delimit ;
keep 
  response_id  // unique row identifier
  user_id  // unique respondent identifier
  response_td  // start-of-response timestamp (day resolution)
  response_tc  // start-of-response timestamp (ms resolution)
  response_seq  // number of prior responses by this respondent
  do_*  // dummies for respondent's current activities (e.g. do_work = Working, studying)

  // the following are hedonic variables asked on a continuous slider, scaled 0 - 100
  feel_hpy  // happy
  feel_rlx  // relaxed
  feel_awk  // awake

  // the following are all constant per respondent, because the items were asked only once
  male  // sex dummy: 1 = male, 0 = female 
  age_at_signup  // respondent's sign-up year minus birth year
  hhinc //  gross household income (midpoint of income band)
  lnhhinc  // log of gross household income (midpoint of income band)
  mrg  // marital status
  work  // employment status
  kids  // number of under-16s in the household
  adults  // number of 16+ adults in the household
  health  // subjective health, 1 (poor) - 5 (excellent)
  ls  // life satisfaction rating, scale of 1 - 10
  
  // UK Government Office region of the location identified (via 'at home' response(s)) as respondent's home
  home_nspd_go_region
;
#delimit cr

In [17]:
replace work = 8 if work == 4  // work == 4 means 'Other', and it reads better if this comes at the end
label define work 8 "other", add


(59,071 real changes made)



In [18]:
save "mappiness_variability_raw.dta", replace

file mappiness_variability_raw.dta saved
