In [1]:
cd("../")

In [119]:
using CSV
using DataFrames
using LinearAlgebra
using Statistics
using Optim
using Distributions
using Random

## P-MEDM Problem and Data

- Census microdata (PUMS) with $n$ = 15 respondents, $N$ = 1286 people.
- Public-Use Microdata Area (PUMA) consisting of 10 block groups.
- **PROBLEM**: PUMS provides no information on the block groups in which each "person" (sample weight unit) in the PUMS resides. 

**Solution: Iterative Proportional Fitting (IPF).** Use census data published for two spatial scales, the target units (block groups) and an upper level (tracts), known as _constraints_ to make a "guess" at where people described by each response type belong. This is done by comparing the published constraints to synthetic constraints reconstructed from each "guess" at the probabilities of people's locations given by P-MEDM. The synthetic constraints are updated iteratively until a "best guess" is reached. 

But...**another problem!** Our census data comes from the American Community Survey (ACS). ACS data is updated annually but is inherently uncertain because it is based on a sample vs. a complete count of the population (i.e., the Decennial Census). 

**Penalized Maximum-Entropy Dasymetric Modeling (P-MEDM)** is an IPF technique specialized to deal with uncertain census data like the ACS (Leyk et al xx; Nagle et al 2014). Besides taking errors between the published and synthetic constraints into account, P-MEDM also accounts for _error variances_. The more "in the ballpark" of the published constraints and their variances the synthetic constraints are, the better the solution. Conversely, if a synthetic constraint is out of the range of probable values (i.e., 90% margin of error) of a published constraint, the more _penalized_ the solution will be.  

In [3]:
## read in data
constraints_ind = CSV.read("data/toy_constraints_ind.csv");
constraints_bg = CSV.read("data/toy_constraints_bg.csv");
constraints_trt = CSV.read("data/toy_constraints_trt.csv");

In [4]:
## build geo lookup
bg_id = string.(collect(constraints_bg[!,1]));
trt_id = [s[1:2] for s in bg_id];
geo_lookup = DataFrame(bg = bg_id, trt = trt_id)

Unnamed: 0_level_0,bg,trt
Unnamed: 0_level_1,String,String
1,101,10
2,102,10
3,103,10
4,201,20
5,202,20
6,301,30
7,302,30
8,303,30
9,304,30
10,305,30


## Setup

In [5]:
## PUMS response ids
serial = collect(constraints_ind.SERIAL);

In [6]:
## PUMS sample weights
wt = collect(constraints_ind.PERWT);

In [7]:
## population and sample size
N = sum(constraints_bg.POP);
n = nrow(constraints_ind);

In [8]:
## Individual-level (PUMS) constraints
excl = ["SERIAL", "PERWT"];
constraint_cols = [i ∉ excl for i in names(constraints_ind)];
pX = constraints_ind[!,constraint_cols];
pX = convert(Matrix, pX);

In [9]:
## geographic constraints
est_cols_bg = [!endswith(i, 's') && i != "GEOID" for i in names(constraints_bg)]
est_cols_trt = [!endswith(i, 's') && i != "GEOID" for i in names(constraints_trt)]
Y1 = convert(Matrix, constraints_trt[!,est_cols_trt])
Y2 = convert(Matrix, constraints_bg[!,est_cols_bg]);

In [10]:
## error variances
se_cols = [endswith(i, 's') for i in names(constraints_bg)];
se_cols = names(constraints_bg)[se_cols];
V1 = map(x -> x^2, convert(Matrix, constraints_trt[!,se_cols]));
V2 = map(x -> x^2, convert(Matrix, constraints_bg[!,se_cols]));

In [11]:
## Geographic crosswalk
A1 = [];

for G in unique(geo_lookup.trt)

    blah = zeros(Int8, 1, nrow(constraints_bg))

    isG = [occursin(G, g) for g in collect(geo_lookup.bg)]
    for i in findall(isG)
        blah[i] = 1
    end
    append!(A1, blah)

end

A1 = reshape(A1, nrow(constraints_bg), nrow(constraints_trt));
A1 = transpose(A1)

3×10 Transpose{Any,Array{Any,2}}:
 1  1  1  0  0  0  0  0  0  0
 0  0  0  1  1  0  0  0  0  0
 0  0  0  0  0  1  1  1  1  1

In [12]:
## Target unit identity matrix
A2 = Matrix(I, nrow(constraints_bg), nrow(constraints_bg))

10×10 Array{Bool,2}:
 1  0  0  0  0  0  0  0  0  0
 0  1  0  0  0  0  0  0  0  0
 0  0  1  0  0  0  0  0  0  0
 0  0  0  1  0  0  0  0  0  0
 0  0  0  0  1  0  0  0  0  0
 0  0  0  0  0  1  0  0  0  0
 0  0  0  0  0  0  1  0  0  0
 0  0  0  0  0  0  0  1  0  0
 0  0  0  0  0  0  0  0  1  0
 0  0  0  0  0  0  0  0  0  1

In [13]:
## Solution space (X-matrix)
X1 = kron(transpose(pX), A1);
X2 = kron(transpose(pX), A2);
X = transpose(vcat(X1, X2));

In [14]:
## Design weights
q = repeat(wt, size(A1)[2]);
q = reshape(q, n, size(A1)[2]);
q = q / sum(q);
q = vec(q');

In [15]:
## Vectorize geo. constraints (Y) and normalize
Y_vec = vcat(vec(Y1), vec(Y2)) / N; 

In [16]:
## Vectorize variances and normalize
V_vec = vcat(vec(V1), vec(V2)) * (n / N^2);

In [17]:
## Diagonal matrix of variances
sV = Diagonal(V_vec);

## Functions

In [18]:
## Compute the P_MEDM probabilities from q, X, λ
compute_allocation = function(q, X, λ)

    a0 = exp.(-X * λ)

    a = a0 .* q;

    b = q' * a0

    a/b

end;

In [19]:
## Primal Function
penalized_entropy = function(w, d, n, N, v)

    e = d - w

    penalty = (e^2 / (2. * v))

    ent = ((n / N) * (w / d) * log((w/d)))

    pe = (-1. * ent) - penalty

    return pe

end;

In [20]:
## Objective Function
neg_pe = function(λ)

    phat = compute_allocation(q, X, λ)
    phat = reshape(phat, size(A2)[1], size(pX)[1])'

    Yhat2 = (N * phat)' * pX

    phat_trt = (phat * N) * A1'
    Yhat1 = phat_trt' * pX

    Yhat = vcat(vec(Yhat1), vec(Yhat2))

    Ype = DataFrame(Y = Y_vec * N, Yhat = Yhat, V = V_vec * (N^2/n))

    pe = penalized_entropy.(Ype.Y, Ype.Yhat, n, N, Ype.V)

    -1. * mean(pe)

end;

## Optimization

In [32]:
opt = optimize(neg_pe, zeros(length(Y_vec)), BFGS(), autodiff = :forward,
            Optim.Options(iterations = 200))

 * Status: success

 * Candidate solution
    Minimizer: [5.36e-01, -3.61e-02, -5.00e-01,  ...]
    Minimum:   -1.553683e-06

 * Found with
    Algorithm:     BFGS
    Initial Point: [0.00e+00, 0.00e+00, 0.00e+00,  ...]

 * Convergence measures
    |x - x'|               = 1.48e-08 ≰ 0.0e+00
    |x - x'|/|x'|          = 5.33e-09 ≰ 0.0e+00
    |f(x) - f(x')|         = 1.83e-16 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.18e-10 ≰ 0.0e+00
    |g(x)|                 = 5.04e-09 ≤ 1.0e-08

 * Work counters
    Seconds run:   1  (vs limit Inf)
    Iterations:    99
    f(x) calls:    281
    ∇f(x) calls:   281


In [33]:
## final coefficients (lambdas/λ)
λ = Optim.minimizer(opt);

In [38]:
## inspect results
phat = compute_allocation(q, X, λ);
phat = reshape(phat, size(A2)[1], size(pX)[1])';

Yhat2 = (N * phat)' * pX;

phat_trt = (phat * N) * A1';
Yhat1 = phat_trt' * pX;

Yhat = vcat(vec(Yhat1), vec(Yhat2));

Ype = DataFrame(Y = Y_vec * N, Yhat = Yhat, V = V_vec * (N^2/n));

#90% MOEs
Ype.MOE_lower = Ype.Y - (sqrt.(Ype.V) * 1.645);
Ype.MOE_upper = Ype.Y + (sqrt.(Ype.V) * 1.645);

first(Ype[:,["Y", "Yhat", "MOE_lower", "MOE_upper"]], 10)

Unnamed: 0_level_0,Y,Yhat,MOE_lower,MOE_upper
Unnamed: 0_level_1,Float64,Any,Float64,Float64
1,344.0,344.0,328.481,359.519
2,260.0,260.0,253.021,266.979
3,682.0,681.999,668.237,695.763
4,152.0,152.01,139.152,164.848
5,84.0,84.0142,71.4721,96.5279
6,200.0,200.033,179.454,220.546
7,101.0,101.04,80.7191,121.281
8,106.0,106.041,92.435,119.565
9,242.0,242.035,218.39,265.61
10,175.0,175.017,156.608,193.392


In [24]:
# Proportion of contstraints falling outside 90% MOE
sum((Ype.Yhat .< Ype.MOE_lower) + (Ype.Yhat .> Ype.MOE_upper) .>= 1) / nrow(Ype)

0.0

## Synthetic Population Estimates

The target-level (block group) estimates of each group are simply

In [69]:
syp = DataFrame(phat * N);
rename!(syp, geo_lookup.bg)

Unnamed: 0_level_0,101,102,103,201,202,301,302,303
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,7.37573,1.93002,0.0658886,3.59323,1.1941,4.04046,1.00221,1.18257
2,3.01273,11.0193,5.87868,8.11077,5.46935,8.44493,11.5618,4.8976
3,4.44526,3.51247,4.36986,2.91984,11.771,5.35233,1.9038,1.45184
4,1.92281,1.24944,14.785,0.918948,25.1528,6.47892,12.7274,19.25
5,0.738304,9.91572,20.2678,3.68321,4.43771,1.62544,0.577827,0.0797601
6,9.1724,1.9738,0.640916,3.25129,7.33586,14.0614,19.2625,45.0792
7,47.3276,12.3843,0.422785,23.0566,7.66215,25.9263,6.43083,7.58815
8,2.11594,9.41098,0.52785,7.82914,0.777584,2.11944,0.525404,0.112216
9,0.935844,0.739467,0.91997,0.614703,2.47811,1.12681,0.400801,0.30565
10,1.2343,5.48974,0.307913,4.567,0.453591,1.23634,0.306486,0.0654595


So, for example, there are roughly 25 people matching record `p4` in block group `202`. 

### Uncertainty

The Inverse Hessian of the solution is the variance-covariance matrix of the coefficients $\lambda$, see https://www.rpubs.com/nnnagle/PMEDM_2.

In [74]:
# recompute the P-MEDM allocation as a vector
p = compute_allocation(q, X, λ);

In [99]:
## compute Inverse Hessian
a = -1 * (X'p * p'X)

dp = Diagonal(p)

b = X'dp * X

H = a + b + sV

covλ = inv(H);

In [131]:
## simulate λ
Random.seed!(808)
nsim = 100
simλ = []

MvNormal(λ, Matrix(Hermitian(covλ/N)))

# for i in collect(1:nsim)
#     s = MvNormal(λ, Matrix(Hermitian(covλ/N)))
#     append!(simλ, vec(s))
# end

FullNormal(
dim: 52
μ: [0.5359724874740687, -0.036121414350144, -0.49985107312387594, -0.16319796978176934, -0.13224184852877566, 0.5774394679078754, -0.039870410690129136, 0.22364130422228673, 0.23493977011638045, -0.6041246682755714  …  1.8528190606815707, -0.9802205616190004, -1.4767231673381807, -0.6621858510982104, 0.5455714679494602, 0.38439475015279057, 0.38497908936910974, 2.094204674098523, -0.6503317448614003, -1.9524340661474457]
Σ: [0.7262146616899474 0.2401760186545681 … 0.0014381082337003007 0.0016706295686136402; 0.2401760186545681 2.6318563022249055 … 0.0014558686503763551 0.0016912615882060257; … ; 0.0014381082337003007 0.0014558686503763551 … 0.46250297240079097 0.42136925006684206; 0.0016706295686136402 0.0016912615882060257 … 0.42136925006684206 0.4827829197170452]
)
