In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.options.display.max_columns = None
pd.options.display.max_rows = None

## I. Explore the data  
  
A. Study variable attributes 
 1. Identify variable name and survey item(s) it measures (Codebook available here: https://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp (accessed on 5/25/2021))  
 2. % missing for each variable
 3. Quick descriptives (check for range of values, distribution shape, skew/outliers, potential errors, etc.)  
 4. Identify target variable (and drop duplicates) - don't forget to do feature engineering on the target variable; break out into varying levels of happiness  
  
B. Visualize the data (based on descriptives)
 1. Explore correlations between attributes
 2. Identify transformations that might be needed
 3. Identify extra data that may be useful (gini coefficient, GDP, etc.)
 4. Summarize findings
 
### Codebook notes:  
  
- Weights (See https://www.worldvaluessurvey.org/WVSContents.jsp for further details): 
    - `S018` and `S019` are weighting factors that transform N's to 1000 and 1500, respectively
    - these variables are useful for cross-country comparisons 
    - useful for EDA and descriptive analyses; should arguably be dropped for random forest algorithm, **right?**
    - **QUESTION:** are weights useful for PCA and logistic regression?
    - **QUESTION:** I see weights, but not specific population or sample size info - do I need this?
        - population data shouldn't be difficult to obtain based on N preserving weightings (`V258`) and this formula  
        
        $$Weight = S018/1000 * Population$$  
  
### Options for handling missing data:  
Advice from https://heartbeat.fritz.ai/data-handling-scenarios-part-2-working-with-missing-values-in-a-dataset-34b758cfc9fa and https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/  
  
**Mean/Median (numerical) & Mode (categorical) imputation**  
1. pros: 
  - easy to do
  - can be integrated into production or for a future unknown dataset
2. cons: 
  - distorts the distribution of the dataset
  - distorts the variance and covariance of the dataset
  - for mode imputation, may lead to an over-representation of the most frequent label if the missing values are quite large
3. when this makes sense: 
  - mean imputation works best for normally distributed distributions
  - median is better for skewed distributions 
  - mode imputation for categorical data works best if the missing values are missing at random
  - best to use this method when the missing values are around 5% (or less) of the total data
  
**Systematic Random Sampling Imputation**  
1. pros: 
  - does not distort variance or distribution 
2. cons: 
  - when replacing missing values in the test set as well, the imputed values from the train set will need to be stored in memory
3. when this makes sense: 
  - can be applied to both numerical and categorical variables
  - used when the values are missing at random
  - when we want to be able to reproduce the same value every time the variable is used (by using a random state)
  
### Thinking ahead to future steps:
- items may need to be normalized or re-scaled so that the ranges are more similar
- items may need to be reverse-coded to assist with interpretability for linear regression
- retain and rename `C_COW_ALPHA` for country labels
- recode age variable `V242`; create age categories based on groupings identified here: https://www.cia.gov/the-world-factbook/field/age-structure/  
- `V74` and `V74B`: Schwartz benevolence value items; consolidate into one variable based on whichever has fewer missings  

In [3]:
wvs_w6_data = pd.read_csv('../data/Evaluating_Happiness/w6_feature_selection.csv', low_memory=False)
wvs_w6_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85898 entries, 0 to 85897
Columns: 166 entries, V2 to V262
dtypes: float64(163), int64(2), object(1)
memory usage: 108.8+ MB


In [4]:
wvs_w6_data.head()

Unnamed: 0,V2,C_COW_ALPHA,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V24,V25,V26,V27,V30,V32,V33,V34,V44,V45,V47,V48,V49,V51,V52,V53,V54,V55,V56,V57,V58,V59,V60,V61,V62,V63,V64,V65,V66,V67,V68,V69,V70,V71,V72,V73,V74,V74B,V75,V76,V77,V78,V79,V80,V82,V83,V84,V96,V97,V98,V99,V100,V101,V102,V103,V104,V105,V106,V107,V108,V109,V110,V111,V113,V114,V115,V116,V117,V119,V120,V121,V122,V123,V124,V126,V131,V132,V133,V134,V135,V136,V137,V138,V139,V140,V143,V144G,V147,V150,V151,V152,V153,V154,V155,V170,V171,V173,V174,V176,V177,V179,V180,V181,V182,V183,V184,V187,V188,V189,V190,V191,V192,V193,V194,V195,V196,V197,V198,V199,V200,V202,V203,V204,V205,V207,V208,V209,V210,V211,V213,V214,V216,V225,V229,V230,V237,V238,V239,V240,V242,V248,V258,S018,S019,V262
0,12,ALG,1.0,1.0,1.0,,1.0,1.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0,4.0,6.0,0.0,10.0,2.0,3.0,1.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,3.0,3.0,2.0,2.0,2.0,4.0,8.0,7.0,6.0,8.0,7.0,5.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,3.0,2.0,2.0,2.0,4.0,3.0,3.0,3.0,2.0,3.0,4.0,4.0,3.0,4.0,3.0,8.0,5.0,6.0,9.0,3.0,4.0,7.0,6.0,7.0,2.0,5.0,1.0,,,10.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,5.0,5.0,5.0,1.0,2.0,2.0,3.0,2.0,2.0,3.0,3.0,3.0,3.0,7.0,8.0,3.0,5.0,6.0,9.0,6.0,6.0,1.0,1.0,1.0,1.0,3.0,1.0,6.0,5.0,1.0,1.0,2.0,2.0,2.0,2.0,6.0,,1.0,4.0,5.0,1.0,21.0,7.0,1.0,0.833333,1.25,2014
1,12,ALG,1.0,2.0,3.0,4.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,3.0,2.0,6.0,8.0,6.0,0.0,10.0,2.0,1.0,2.0,3.0,4.0,3.0,1.0,1.0,1.0,2.0,2.0,3.0,2.0,1.0,1.0,1.0,3.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,3.0,7.0,5.0,5.0,4.0,4.0,6.0,1.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,2.0,3.0,1.0,2.0,2.0,3.0,2.0,2.0,3.0,2.0,2.0,3.0,3.0,2.0,8.0,8.0,8.0,9.0,2.0,6.0,4.0,2.0,4.0,1.0,5.0,1.0,2.0,1.0,10.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,5.0,5.0,1.0,5.0,2.0,3.0,4.0,2.0,2.0,2.0,2.0,2.0,2.0,4.0,8.0,4.0,6.0,4.0,8.0,3.0,4.0,7.0,1.0,1.0,1.0,1.0,1.0,3.0,5.0,1.0,2.0,2.0,2.0,2.0,3.0,6.0,,2.0,3.0,6.0,2.0,24.0,7.0,1.0,0.833333,1.25,2014
2,12,ALG,1.0,3.0,2.0,4.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,6.0,8.0,6.0,0.0,6.0,2.0,4.0,1.0,2.0,1.0,4.0,1.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,4.0,3.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,7.0,7.0,7.0,5.0,7.0,5.0,1.0,3.0,3.0,4.0,4.0,4.0,3.0,2.0,2.0,2.0,4.0,3.0,2.0,2.0,2.0,4.0,3.0,2.0,3.0,2.0,4.0,2.0,2.0,7.0,4.0,8.0,3.0,3.0,6.0,9.0,5.0,6.0,1.0,5.0,1.0,2.0,1.0,6.0,2.0,3.0,1.0,2.0,2.0,3.0,3.0,5.0,5.0,5.0,5.0,2.0,3.0,2.0,3.0,2.0,3.0,3.0,3.0,3.0,4.0,7.0,5.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,4.0,1.0,4.0,5.0,1.0,1.0,3.0,2.0,4.0,2.0,3.0,2.0,1.0,4.0,6.0,2.0,26.0,5.0,1.0,0.833333,1.25,2014
3,12,ALG,1.0,1.0,3.0,4.0,3.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,6.0,8.0,6.0,0.0,6.0,2.0,1.0,3.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,3.0,2.0,3.0,3.0,1.0,1.0,2.0,2.0,3.0,1.0,2.0,2.0,2.0,9.0,5.0,6.0,4.0,6.0,8.0,1.0,3.0,3.0,2.0,2.0,3.0,2.0,3.0,4.0,2.0,4.0,2.0,3.0,3.0,4.0,2.0,2.0,3.0,1.0,2.0,4.0,3.0,2.0,7.0,9.0,5.0,5.0,7.0,3.0,8.0,7.0,8.0,2.0,5.0,1.0,2.0,1.0,10.0,2.0,3.0,4.0,1.0,2.0,2.0,2.0,5.0,5.0,1.0,5.0,2.0,3.0,3.0,3.0,2.0,2.0,3.0,3.0,3.0,6.0,6.0,3.0,5.0,5.0,7.0,4.0,6.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,4.0,4.0,5.0,2.0,28.0,6.0,1.0,0.833333,1.25,2014
4,12,ALG,1.0,1.0,1.0,2.0,1.0,1.0,1.0,3.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,2.0,2.0,6.0,6.0,1.0,3.0,4.0,2.0,1.0,2.0,3.0,4.0,2.0,1.0,2.0,1.0,1.0,2.0,5.0,1.0,2.0,3.0,1.0,4.0,3.0,2.0,2.0,3.0,1.0,2.0,2.0,2.0,8.0,4.0,7.0,4.0,6.0,6.0,2.0,2.0,3.0,4.0,2.0,3.0,2.0,3.0,3.0,2.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,2.0,4.0,3.0,2.0,3.0,2.0,8.0,4.0,7.0,3.0,3.0,8.0,6.0,5.0,6.0,2.0,5.0,1.0,1.0,1.0,10.0,2.0,3.0,2.0,2.0,2.0,3.0,3.0,5.0,5.0,5.0,5.0,2.0,3.0,3.0,4.0,2.0,3.0,3.0,3.0,3.0,6.0,2.0,4.0,4.0,6.0,6.0,6.0,5.0,7.0,1.0,1.0,1.0,3.0,1.0,4.0,5.0,1.0,1.0,2.0,2.0,3.0,2.0,3.0,2.0,2.0,3.0,7.0,2.0,35.0,3.0,1.0,0.833333,1.25,2014
