# P-MEDM "Real World" Example

In [1]:
cd("../")

In [2]:
using CSV
using DataFrames
using SparseArrays
using Kronecker
using LinearAlgebra
using SuiteSparse
using Statistics

## P-MEDM Problem and Data

- We have some census microdata (PUMS) for Boulder, Colorado from the 2016 American Community Survey 5-Year Estimates. The microdata consists of with $n$ = 5995 respondents and $N$ = 121,708 people.
- Boulder is represented by a Public-Use Microdata Area (PUMA) consisting of 84 block groups and 27 census tracts.
- **PROBLEM**: PUMS provides no information on the block groups and census tracts in which each "person" (sample weight unit) in the PUMS resides. 

**Solution: Iterative Proportional Fitting (IPF).** Use census data published for two spatial scales, the target units (block groups) and an upper level (tracts, 27 units), known as _constraints_ to make a "guess" at where people described by each response type belong. This is done by comparing the published constraints to synthetic constraints reconstructed from each "guess" at the probabilities of people's locations given by P-MEDM. The synthetic constraints are updated iteratively until a "best guess" is reached. 

But...**another problem!** Our census data comes from the American Community Survey (ACS). ACS data is updated annually but is inherently uncertain because it is based on a sample vs. a complete count of the population (i.e., the Decennial Census). 

**Penalized Maximum-Entropy Dasymetric Modeling (P-MEDM)** is an IPF technique specialized to deal with uncertain census data like the ACS (Leyk et al xx; Nagle et al 2014). Besides taking errors between the published and synthetic constraints into account, P-MEDM also accounts for _error variances_. The more "in the ballpark" of the published constraints and their variances the synthetic constraints are, the better the solution. Conversely, if a synthetic constraint is out of the range of probable values (i.e., 90% margin of error) of a published constraint, the more _penalized_ the solution will be.  

In [3]:
## read in data
constraints_ind = CSV.read("data/boulder_constraints_ind_2016_person.csv")
constraints_bg = CSV.read("data/boulder_constraints_bg_2016_person.csv")
constraints_trt = CSV.read("data/boulder_constraints_trt_2016_person.csv");
constraints_ind = constraints_ind[:,2:ncol(constraints_ind)]
constraints_bg = constraints_bg[:,2:ncol(constraints_bg)]
constraints_trt = constraints_trt[:,2:ncol(constraints_trt)];

The **individual-level** constraints are derived from PUMS data. They consist of: 

- A unique ID for each response (`pid`).
- The sample weight (`wt`), which estimates how many people in the PUMA are described by each PUMS response.
- Binary-encoded constraints.

In [4]:
# preview individual-level constraints
first(constraints_ind, 5)

Unnamed: 0_level_0,pid,wt,POP,HH,AGE5U,AGE18U,AGE65O,NFAM,NFAMALONE,POV,RACWHITE
Unnamed: 0_level_1,String,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,69a,13,1,1,0,0,0,1,0,0,1
2,69b,10,1,1,0,0,0,1,0,0,1
3,1209,15,1,0,0,0,0,0,0,0,1
4,1417a,20,1,1,0,0,0,1,0,0,1
5,1417b,21,1,1,0,0,0,1,0,1,1


The **geographic constraints** at the target/upper levels are derived from the ACS Summary File. They consist of: 

- The geographic identifier for each census unit (`GEOID`).
- The constraint estimates (i.e., `POP`).
- The standard errors on the constraints, denoted here with a trailing `s`, (i.e., `POPs`). 

In [5]:
# preview target (block-group) level constraints
first(constraints_bg, 5)

Unnamed: 0_level_0,GEOID,POP,POPs,HH,HHs,AGE5U,AGE5Us,AGE18U,AGE18Us,AGE65O
Unnamed: 0_level_1,Int64,Int64,Float64,Int64,Float64,Int64,Float64,Int64,Float64,Int64
1,80130121011,960,176.9,960,176.9,47,33.5009,156,59.3071,12
2,80130121012,1470,139.818,1470,139.818,98,44.3727,325,79.2818,288
3,80130121013,1137,135.562,1127,134.954,76,29.7376,257,53.1732,140
4,80130121014,1078,169.605,1078,169.605,14,14.9524,284,60.6168,54
5,80130121021,1500,207.903,1500,207.903,107,39.5464,253,67.8568,92


In [6]:
# preview upper (tract) level constraints
first(constraints_trt, 5)

Unnamed: 0_level_0,GEOID,POP,POPs,HH,HHs,AGE5U,AGE5Us,AGE18U,AGE18Us,AGE65O
Unnamed: 0_level_1,Int64,Int64,Float64,Int64,Float64,Int64,Float64,Int64,Float64,Int64
1,8013012101,4645,207.903,4635,208.511,235,65.5408,1022,124.29,494
2,8013012102,7458,313.678,7279,309.422,420,86.2215,1377,160.509,843
3,8013012103,3953,165.35,3931,167.173,111,32.2418,776,91.9884,539
4,8013012104,2424,123.404,2414,123.404,87,46.0604,586,86.1679,498
5,8013012105,5604,299.696,5594,300.304,362,78.758,1352,196.718,637


The full set of constraints for this example is:

In [7]:
# placeholder, add table/key
is_constraint = [!endswith(i, 's') && i != "GEOID" for i in names(constraints_bg)]
names(constraints_bg)[constraints, :]        

UndefVarError: UndefVarError: constraints not defined

Because the target/upper levels are nested, we need a crosswalk (`geo_lookup`) to link them:

In [8]:
## build geo lookup
geo_lookup = CSV.read("data/boulder_geo_lookup.csv")[:,["bg", "trt"]];
geo_lookup.bg = string.(geo_lookup.bg)
geo_lookup.trt = string.(geo_lookup.trt)
first(geo_lookup, 5)

Unnamed: 0_level_0,bg,trt
Unnamed: 0_level_1,String,String
1,80130121011,8013012101
2,80130121012,8013012101
3,80130121013,8013012101
4,80130121014,8013012101
5,80130121021,8013012102


In [9]:
# Ensure tract IDs between `constraints_trt` and `geo_lookup` are consistent
tix = indexin(unique(geo_lookup.trt), string.(constraints_trt.GEOID));
constraints_trt = constraints_trt[tix,:];
println(sum(unique(geo_lookup.bg) .== string.(constraints_bg.GEOID)) == nrow(constraints_bg))
sum(unique(geo_lookup.trt) .== string.(constraints_trt.GEOID)) == nrow(constraints_trt)

true


true

## Setup

In [10]:
## PUMS response ids
serial = collect(constraints_ind.pid);

In [11]:
## PUMS sample weights
wt = collect(constraints_ind.wt);

In [12]:
## population and sample size
N = sum(constraints_bg.POP);
n = nrow(constraints_ind);

### Individual (PUMS) Constraints

Isolate the individual constraints and store them in a matrix `pX`: 

In [13]:
## Individual-level (PUMS) constraints
excl = ["pid", "wt"]
constraint_cols = [i ∉ excl for i in names(constraints_ind)];
pX = constraints_ind[!,constraint_cols];
pX = convert(Matrix, pX);

### Geographic Constraints

First, ID the constraint columns representing estimates and the constraint columns representing standard errors:

In [14]:
est_cols_bg = [!endswith(i, 's') && i != "GEOID" for i in names(constraints_bg)]
est_cols_trt = [!endswith(i, 's') && i != "GEOID" for i in names(constraints_trt)]
se_cols = [endswith(i, 's') for i in names(constraints_bg)]
se_cols = names(constraints_bg)[se_cols];

Next, isolate the geographic constraint estimates `Y`, first at the upper level, then at the target level:

In [15]:
Y1 = convert(Matrix, constraints_trt[!,est_cols_trt])
Y2 = convert(Matrix, constraints_bg[!,est_cols_bg]);

Then another set of matrices to store the error variances (`V`) based on `se_cols`:

In [16]:
## error variances
V1 = map(x -> x^2, convert(Matrix, constraints_trt[!,se_cols]))
V2 = map(x -> x^2, convert(Matrix, constraints_bg[!,se_cols]));

### Linking Geographies

Now we need to symbolically link the upper and target geographic constraint levels. This is done based on `geo_lookup`.

First, generate a **crosswalk** between the upper and target levels (a matrix of $n$ upper units by $m$ target units):

In [17]:
## Geographic crosswalk
A1 = Int64[]

for G in unique(geo_lookup.trt)

    blah = zeros(Int64, 1, nrow(constraints_bg))

    isG = [occursin(G, g) for g in collect(geo_lookup.bg)]
    for i in findall(isG)
        blah[i] = 1
    end
    append!(A1, blah)

end

A1 = reshape(A1, nrow(constraints_bg), nrow(constraints_trt))
A1 = transpose(A1);

Then generate an **identity matrix** for the target units:

In [18]:
## Target unit identity matrix
A2 = Matrix(I, nrow(constraints_bg), nrow(constraints_bg));

### Linking the PUMS constraints to Geographies

#### Model Matrix

Next, create the model matrix `X` for the P-MEDM problem. It consists of:

1. The Kronecker product of the PUMS constraints `pX` and the geographic crosswalk `A1`.
2. The Kronecker product of the PUMS constraints `pX` and the target level identity matrix `A2`.

In [19]:
# a bit slow
X1 = (sparse(pX') ⊗ A1)'
X2 = (sparse(pX') ⊗ A2)'
X = hcat(sparse(X1), sparse(X2));

#### Design Weights

The design weights `q` are the prior estimate of the P-MEDM allocation matrix. The assumption is that each PUMS record has an equal probability of being found in each target unit, relative to the sample weight's share of the total population.

In [20]:
## Design weights
q = repeat(wt, size(A1)[2]);
q = reshape(q, n, size(A1)[2]);
q = q / sum(q);

### Vectorize Inputs

Since P-MEDM is a linear problem, we need to vectorize the constraints (`Y`), error variances (`V`), and design weights (`q`) before running it. 

In [21]:
## Vectorize geo. constraints (Y) and normalize
Y_vec = vcat(vec(Y1), vec(Y2)) / N; 

In [22]:
## Vectorize variances and normalize
V_vec = vcat(vec(V1), vec(V2)) * (n / N^2);

In [23]:
## Vectorize the design weights
q = vec(q');

A diagonal matrix of the error variances (`sV`) is also generated to compute the Hessian for the solver:

In [24]:
## Diagonal matrix of variances
sV = Diagonal(V_vec);

## Functions

In [25]:
## Convenience function to compute the P_MEDM probabilities from q, X, λ
compute_allocation = function(q, X, λ)

    qXl = q .* exp.(-X * λ)
    p = qXl / sum(qXl)

end;

In [26]:
## Objective Function
f = function(λ)

    qXl = exp.(-X * λ) .* q
    p = qXl / sum(qXl)

    Xp = X' * p
    lvl = λ' * (sV * λ);

    return (Y_vec' * λ) + log(sum(qXl)) + (0.5 * lvl)

end;

In [27]:
## Gradient function
g! = function(G, λ)
    
    qXl = q .* exp.(-X * λ)
    p = qXl / sum(qXl)
    Xp = X'p
    G[:] = Y_vec + (sV * λ) - Xp
    
end;

In [28]:
## Hessian function
h! = function(H, λ)
    
    qXl = q .* exp.(-X * λ)
    p = qXl / sum(qXl)
    dp = spdiagm(0 => p)
    H[:] = -((X'p) * (p'X)) + (X' * dp * X) + sV
    
end;

## Optimization: Solving the P-MEDM Problem

We initialize the coefficient estimates $\lambda$ at 0. When initializing $\lambda$ at 0, our prior allocation probabilities are equal to `q`. Each successive step in the P-MEDM optimization updates $\lambda$ and refines the prior allocation probabilities given by `q`.

In [29]:
using Optim

init_λ = zeros(length(Y_vec))

opt = optimize(f, g!, h!, init_λ, NewtonTrustRegion(),
               Optim.Options(show_trace=true, iterations = 200));

Iter     Function value   Gradient norm 
     0     0.000000e+00     3.355729e-02
 * time: 0.0736839771270752
     1    -1.055967e-01     1.761934e-02
 * time: 3.564579963684082
     2    -2.184494e-01     1.313110e-02
 * time: 6.883098840713501
     3    -3.018777e-01     8.503686e-03
 * time: 11.576177835464478
     4    -3.448676e-01     2.491152e-03
 * time: 14.889793872833252
     5    -3.565122e-01     8.592010e-04
 * time: 18.04591393470764
     6    -3.580423e-01     2.600936e-04
 * time: 22.101778984069824
     7    -3.581893e-01     6.033541e-05
 * time: 25.195271968841553
     8    -3.582005e-01     7.145121e-06
 * time: 28.4721999168396
     9    -3.582008e-01     1.856517e-07
 * time: 31.253708839416504
    10    -3.582008e-01     1.785954e-10
 * time: 32.71721386909485


In [30]:
## final coefficients (lambdas/λ)
λ = Optim.minimizer(opt);

In [31]:
## inspect results
phat = compute_allocation(q, X, λ);
phat = reshape(phat, size(A2)[1], size(pX)[1])';

Yhat2 = (N * phat)' * pX;

phat_trt = (phat * N) * A1';
Yhat1 = phat_trt' * pX;

Yhat = vcat(vec(Yhat1), vec(Yhat2));

Ype = DataFrame(Y = Y_vec * N, Yhat = Yhat, V = V_vec * (N^2/n));

#90% MOEs
Ype.MOE_lower = Ype.Y - (sqrt.(Ype.V) * 1.645);
Ype.MOE_upper = Ype.Y + (sqrt.(Ype.V) * 1.645);

first(Ype[:,["Y", "Yhat", "MOE_lower", "MOE_upper"]], 10)

Unnamed: 0_level_0,Y,Yhat,MOE_lower,MOE_upper
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,4645.0,4947.0,4303.0,4987.0
2,7458.0,7610.55,6942.0,7974.0
3,3953.0,4082.61,3681.0,4225.0
4,2424.0,2560.43,2221.0,2627.0
5,5604.0,5729.38,5111.0,6097.0
6,3760.0,4025.78,3376.0,4144.0
7,6189.0,6356.08,5531.0,6847.0
8,7131.0,7113.02,6597.0,7665.0
9,5755.0,5865.87,5408.0,6102.0
10,3949.0,4102.31,3673.0,4225.0


Roughly 4.6% of the 1665 constraints from the P-MEDM results do not fall within the ACS 90% MOEs:

In [32]:
# Proportion of constraints falling outside 90% MOE
sum((Ype.Yhat .< Ype.MOE_lower) + (Ype.Yhat .> Ype.MOE_upper) .>= 1) / nrow(Ype)

0.04624624624624624

## Synthetic Population Estimates

The target-level (block group) estimates of each group are simply

In [33]:
syp = DataFrame(hcat(serial, (phat * N)))
rn = collect(geo_lookup.bg)
rn = append!(["pid"], rn)
rename!(syp, rn);

### Uncertainty

##### Compute the 95% Confidence Intervals on the Synthetic Population Estimates

In [34]:
# recompute the P-MEDM allocation as a vector
p = compute_allocation(q, X, λ);

The Inverse Hessian of the solution is the variance-covariance matrix of the coefficients $\lambda$, see https://www.rpubs.com/nnnagle/PMEDM_2.

In [35]:
## compute Inverse Hessian
H = h!(Array{Float64}(undef, length(λ), length(λ)), λ);
covλ = inv(H);

In [36]:
## simulate λ
using Distributions
using Random
Random.seed!(808)
nsim = 100
simλ = []

mvn = MvNormal(λ, Matrix(Hermitian(covλ/N)))

simλ = rand(mvn, nsim);

In [37]:
## simulate p
psim = []

for s in 1:nsim
    ps = compute_allocation(q, X, simλ[:,s])
    ps = ps * N
    append!(psim, ps)
end

psim = reshape(psim, :, nsim);

In [38]:
# 95% confidence interval
ci = [quantile(psim[i,:], (0.025, 0.975)) for i in 1:size(psim)[1]]
ci = DataFrame(ci)
rename!(ci, ["lower", "upper"]);

In [39]:
# melt the synthetic pop ests (wide to long format)
# 2nd arg = melt columns, 3rd arg = id columns
sypm = stack(syp, names(syp)[2:size(syp)[2]], :pid);
rename!(sypm, ["pid", "geoid", "est"]);

In [40]:
# ensure order matches conf ints
sort!(sypm, [:pid]);

In [41]:
## append the conf ints
res_ci = hcat(sypm, ci);

In [42]:
first(res_ci, 5)

Unnamed: 0_level_0,pid,geoid,est,lower,upper
Unnamed: 0_level_1,Any,Cat…,Any,Float64,Float64
1,1000306a,80130121011,0.121192,0.131494,0.151479
2,1000306a,80130121012,0.183809,0.124105,0.148392
3,1000306a,80130121013,0.118596,0.120327,0.146623
4,1000306a,80130121014,0.141434,0.0936037,0.115228
5,1000306a,80130121021,0.153009,0.176153,0.202539


### Reliability

In this example, we compute the Monte Carlo Coefficient of Variation for all PUMS responses by block group. (_Need to check these results..._)

- [Ref1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3337209/)
- [Ref2](https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118849972.app2#:~:text=The%20coefficient%20of%20variance%20is,criterion%20in%20Monte%20Carlo%20simulation.&text=Generating%20a%20random%20number%20is%20a%20key%20step%20in%20Monte%20Carlo%20simulation.&text=In%20principle%2C%20a%20pseudo%2Drandom,tested%20to%20assure%20its%20randomness.)

In [43]:
# Monte Carlo error
mce = [std(psim[i,:]) for i in 1:size(psim)[1]];

In [44]:
# Monte Carlo Coefficient of Variation
mcv = p ./ mce;