## Introduction

This notebook file is based on Josep Espasa's book on [How to Weight a Survey](https://bookdown.org/jespasareig/Book_How_to_weight_a_survey/introduction.html). The purpose of this notebook file is to translate his R code into Python. Also the original data files were SPSS files. I converted them into .csv files for easier importing.

I tried to summarize Josep's book into shorter paragraphs for easier digestion. Please refer to his book for more in-depth explaination.

In [91]:
import pandas as pd

In [94]:
data = pd.read_csv('../data/uk_data.csv')
data.head()

Unnamed: 0,idno,typesamp,interva,telnum,agea_1,gendera1,type,access,physa,littera,vandaa,psu,prob,vote,prtvtbgb,prtclbgb,prtdgcl,ctzcntr,ctzshipc,brncntr,cntbrthc,cgtsmke,cgtsday,alcfreq,alcwkdy,alcwknd,hhmmb,gndr,agea,eisced,pdwrk,edctn,uempla,uempli,rtrd,wrkctra,hinctnta,region,alcohol_day
0,100000003,Address,Complete and valid interview related to CF,Present,,,Single unit: Terraced house,"No, neither of these",Satisfactory,Small amount,None or almost none,9388.0,0.000203,Yes,Liberal Democrat,,,Yes,,Yes,,I have never smoked,0.0,Once a week,0.0,8.0,1.0,Female,49.0,"ES-ISCED V2, higher tertiary education, >= MA ...",Marked,Not marked,Not marked,Not marked,Not marked,Unlimited,M - 4th decile,West Midlands (England),0.326531
1,100000005,Address,Complete and valid interview related to CF,Present,,,Single unit: Semi-detached house,"No, neither of these",Good,None or almost none,None or almost none,9241.0,0.000203,Yes,Liberal Democrat,,,Yes,,Yes,,I smoke daily,10.0,Once a week,48.0,96.0,2.0,Female,45.0,"ES-ISCED V1, lower tertiary education, BA level",Marked,Not marked,Not marked,Not marked,Not marked,No contract,R - 2nd decile,Scotland,8.816327
2,100000008,Address,Complete and valid interview related to CF,Present,,,Single unit: Terraced house,"Yes, locked gate/door",Good,None or almost none,None or almost none,9472.0,0.000203,Yes,Labour,Labour,Not close,Yes,,Yes,,I have never smoked,0.0,Several times a week,42.0,0.0,1.0,Male,76.0,"ES-ISCED I , less than lower secondary",Not marked,Not marked,Not marked,Not marked,Marked,Unlimited,R - 2nd decile,East Midlands (England),0.914286
3,100000009,Address,Complete and valid interview related to CF,Present,,,Single unit: Semi-detached house,"No, neither of these",Satisfactory,Large amount,None or almost none,9450.0,5.1e-05,Yes,Labour,,,Yes,,No,,I have never smoked,0.0,Never,,,5.0,Male,50.0,"ES-ISCED II, lower secondary",Not marked,Not marked,Not marked,Not marked,Not marked,Unlimited,S - 6th decile,South East (England),0.0
4,100000010,Address,Complete and valid interview related to CF,Present,,,Single unit: Terraced house,"No, neither of these",Satisfactory,None or almost none,None or almost none,9479.0,0.000102,No,,,,Yes,,Yes,,I smoke daily,20.0,2-3 times a month,55.0,0.0,2.0,Male,67.0,"ES-ISCED I , less than lower secondary",Not marked,Not marked,Not marked,Not marked,Marked,,J - 1st decile,Yorkshire and the Humber,0.032653


#### Meaning of each Column
 1        idno Respondent's identification number                                             
 2    typesamp Type of the sample                                                             
 3     interva Interview information for the sample unit                                      
 4      telnum Telephone number                                                               
 5      agea_1 Estimation of age of respondent or household member who refuses, by interviewer
 6    gendera1 Gender of respondent or household member who refuses, recorded by interviewer  
 7        type Type of house respondent lives in                                              
 8      access Entry phone or locked gate/door before reaching respondent's individual door   
 9       physa Assessment overall physical condition building/house                           
10     littera Amount of litter and rubbish in the immediate vicinity                         
11      vandaa Amount of vandalism and graffiti in the immediate vicinity                     
12         psu PSU                                                                            
13        prob PROB                                                                           
14        vote Voted last national election                                                   
15    prtvtbgb Party voted for in last national election, United Kingdom                      
16    prtclbgb Which party feel closer to, United Kingdom                                     
17     prtdgcl How close to party                                                             
18     ctzcntr Citizen of country                                                             
19    ctzshipc Citizenship                                                                    
20     brncntr Born in country                                                                
21    cntbrthc Country of birth                                                               
22     cgtsmke Cigarettes smoking behaviour                                                   
23     cgtsday How many cigarettes smoke on typical day                                       
24     alcfreq How often drink alcohol                                                        
25     alcwkdy Grams alcohol, last time drinking on a weekday, Monday to Thursday             
26     alcwknd Grams alcohol, last time drinking on a weekend day, Friday to Sunday           
27       hhmmb Number of people living regularly as member of household                       
28        gndr Gender                                                                         
29        agea Age of respondent, calculated                                                  
30      eisced Highest level of education, ES - ISCED                                         
31       pdwrk Doing last 7 days: paid work                                                   
32       edctn Doing last 7 days: education                                                   
33      uempla Doing last 7 days: unemployed, actively looking for job                        
34      uempli Doing last 7 days: unemployed, not actively looking for job                    
35        rtrd Doing last 7 days: retired                                                     
36     wrkctra Employment contract unlimited or limited duration                              
37    hinctnta Household's total net income, all sources                                      
38      region Region                                                                         
39 alcohol_day Respondent's identification number

### Four Steps of Weighting

1. Base/design weights
2. Non-response weights
3. Use of auxiliary data/calibration
4. Analysis of weight variability/trimming

## Base/Design Weights

### Definition
Deisng Weight - The inverse of the probability of a sample being drawn from the survey frame

### Example 
Let's say there are 100 people in a group. 80 of them are male, and 20 of them are female. We decide to draw 10 samples out of them group with 5 being male and 5 being female.

Then, the chance of a male being drawn is
    
$
5 \div 80 = 0.0625
$

The chance of a female being drawn is

$
5 \div 20 = 0.25
$

The design weight for male is:

$
1 \div 0.0625 = 80 \div 5 = 16
$

The design weight for female is:

$
1 \div 0.25 = 20 \div 5 = 4
$

It means every male in the survey represents 16 males in the group and every female in the survey represents 4 females in the group.

For the 7th European Social Survey in the UK (the csv data file used in the following section), the sampling process is described as:

1. Selecte 20 addresses from a sampled postcode address file
2. Select 1 dwelling from each address (dwelling is defined as a self-contained place to live that does not require basic facilities such as cooking, washing or toilet facilities to be shared with the occupants of other dwelling units. )
3. 1 household is selected form the dwelling and then 1 person from each household.

For more detail, [please read page 165 of the ESS7 data documentation](https://www.europeansocialsurvey.org/docs/round7/survey/ESS7_data_documentation_report_e03_2.pdf) under the section titled ‘Sampling procedure’.


In [59]:
# remove rows with NaN prob
data_clean = data.dropna(subset=['prob']).copy()

There are 15 unique probabilities

In [112]:
data_clean.prob.nunique()

15

The two most common probabilities are 0.0001015308 and 0.0002030616 which corresponds to one in 9849 and one in 4925.

In [113]:
data_clean.prob.value_counts()

0.0001015308    1144
0.0002030616     787
0.0000676872     188
0.0000507654     101
0.0000406123      22
0.0000338436       6
0.0000225624       6
0.0000184601       2
0.0001368103       2
0.0000203062       1
0.0000253827       1
0.0000126914       1
0.0000168276       1
0.0000169218       1
0.0000290088       1
Name: prob, dtype: int64

Also, People living in retirement housing have a higher chance of being selected than people living in a student apartments. This makes intuitive sense as there are generally fewer people living in retirement housing than a student dorm.

In [53]:
data_clean.groupby('type').agg({'prob':['mean'], 'idno':['count']}).sort_values(by=[('prob', 'mean')], ascending=False)

Unnamed: 0_level_0,prob,idno
Unnamed: 0_level_1,mean,count
type,Unnamed: 1_level_2,Unnamed: 2_level_2
Multi-unit: Sheltered/retirement housing,0.000187,21
"Multi-unit house, flat",0.000162,346
Other,0.00016,15
House-trailer or boat,0.000135,3
Single unit: Terraced house,0.000134,591
Only housing unit in building with other purpose,0.000131,7
Single unit: Semi-detached house,0.000121,670
Single unit: Detached house,0.000117,582
Farm,0.000115,24
"Multi-unit: Student apartments, rooms",4.1e-05,4


By looking at the hhmmb column （i.e. Number of people living regularly as member of household）, we can see that the more people living in a household, the less of a chance they will be sampled.

In [121]:
def convert_hhmmb_to_category(x):
    if x < 6:
        return str(x)
    else:
        return '5+'
    
data_clean['hhmmb'] = data_clean.hhmmb.apply(convert_hhmmb_to_category)
data_clean['hhmmb'] = data_clean['hhmmb'].astype('category')
data_clean_h = data_clean.dropna(subset=['hhmmb']).copy()

data_clean_h.groupby('hhmmb').agg({'idno':['count'], 'prob':['mean']}).sort_values(by=('prob','mean'), ascending=True)

Unnamed: 0_level_0,idno,prob
Unnamed: 0_level_1,count,mean
hhmmb,Unnamed: 1_level_2,Unnamed: 2_level_2
5+,33,7.48662e-05
5.0,118,8.32801e-05
4.0,303,9.02682e-05
3.0,301,9.80303e-05
2.0,764,0.0001084191
1.0,745,0.0001926022


## Calculate Base/Design Weight

To calculat the base weight, use:

$1 \div prob$

In [122]:
data['base_weight'] = 1 / data.prob
data[['prob', 'base_weight']].head(10)

Unnamed: 0,prob,base_weight
0,0.0002030616,4924.6130357143
1,0.0002030616,4924.6130357143
2,0.0002030616,4924.6130357143
3,5.07654e-05,19698.4521428571
4,0.0001015308,9849.2260714286
5,0.0001015308,9849.2260714286
6,0.0001015308,9849.2260714286
7,0.0002030616,4924.6130357143
8,0.0001015308,9849.2260714286
9,0.0001015308,9849.2260714286


base_weight的含义是每一个抽样单位代表的人群数量  
所有design weight之和应该与整个人群的大小相等

In [74]:
bw_clean = data.dropna(subset=['base_weight']).copy()

> The sum of all design weights should be equal to the total number of units in our population. The ESS dataset for UK only included sampling probabilities for respondents (i.e. sampled units that responded to the survey!) but they did not include sampling probabilities of non-respondents. We can guess that this is because sampling probability depends on information that is obtained from the interview (i.e. number of people in household, number of households in dwelling, number of dwellings in adress, etc.). Not knowing the sampling probability for some sampled units is not an optimal situation.

> The sum of our computed weights in the ESS dataset with 2,264 respondents equals 21,338,524. Doing a very simple Extrapolation to include the 3,335 non-respondents would give us a sum of weights equal to 52,757,499. This last figure would be much closer to the total UK population over 15.

In [82]:
data_clean.shape

(2264, 39)

> It is a common practice for many researchers to scale the weights so that their sum equals the sample size (instead of the population size). Here we compute our scaled design weights and we compare them with the ones given in the ESS dataset. We see that our weights scaled (base.weigth.scaled) are almost equal to those computed in the ESS dataset (dweigth). The small differences are probably due to rounding error.

In [86]:
bw_clean['base_weight_scaled'] = bw_clean.base_weight / bw_clean.base_weight.sum() * data_clean.shape[0]
merged = bw_clean[['idno','base_weight', 'base_weight_scaled']].merge(original[['idno','dweight']], on='idno')

In [87]:
merged.head(10)

Unnamed: 0,idno,base_weight,base_weight_scaled,dweight
0,100000003,4924.613036,0.522497,0.526796
1,100000005,4924.613036,0.522497,0.526796
2,100000008,4924.613036,0.522497,0.526796
3,100000009,19698.452143,2.08999,2.107184
4,100000010,9849.226071,1.044995,1.053592
5,100000012,9849.226071,1.044995,1.053592
6,100000015,9849.226071,1.044995,1.053592
7,100000016,4924.613036,0.522497,0.526796
8,100000017,9849.226071,1.044995,1.053592
9,100000020,9849.226071,1.044995,1.053592


> design weights should sum up to the entire population from which the sample is drawn or to the total number of respondents if scaled as they did in the ESS

In [90]:
merged.base_weight_scaled.sum()

2264.0000000000005

In [89]:
merged.dweight.sum()

2264.000000000001