In [43]:
include("../src/SyntheticPopulation.jl")

# Summary:
1. Sample-free IPF algorithm is well suited for creating joint distributions of the attributes that are independent (96% of cells fit well).
2. Sample-free IPF algorithm is not so efficient for creating joint distributions of the attributes that are highly dependent (10% of cells fit well).
3. To overcome the challenge, we provide a configurable config file which helps better adjust IPF algorithm. The config is described in another file.

# 1. Population with dependent variables

### 1.1. Generate target population with dependent variables
First we generate target population that we will treat as the real population which we'll try to synthesize using available algorithms. The variables maritial status and income are highly dependent on sex.

In [44]:
SIZE = 600000
SEX = ['M', 'F']; SEX_WEIGHTS = [0.5, 0.5]
MARITIAL_STATUS = ["Not_married", "Married", "Divorced", "Widowed"]; 
MARITIAL_WEIGHTS_M = [0.1, 0.2, 0.3, 0.4]; 
MARITIAL_WEIGHTS_F = [0.4, 0.3, 0.2, 0.1];
AGE = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]; 
AGE_WEIGHTS = repeat([1 / 16], 16)
INCOME = [40000, 50000, 60000, 70000, 80000]; 
INCOME_WEIGHTS_M = [0.01, 0.1, 0.2, 0.3, 0.39];
INCOME_WEIGHTS_F = [0.39, 0.3, 0.2, 0.1, 0.01];


using StatsBase
population_m = DataFrame(
    AGE = sample(AGE, Weights(AGE_WEIGHTS),Int(SIZE / 2)),
    MARITIAL_STATUS = sample(MARITIAL_STATUS, Weights(MARITIAL_WEIGHTS_M), Int(SIZE / 2)),
    SEX = repeat(['M'], Int(SIZE / 2)),
    INCOME = sample(INCOME, Weights(INCOME_WEIGHTS_M), Int(SIZE / 2)),
)
population_f = DataFrame(
    AGE = sample(AGE, Weights(AGE_WEIGHTS), Int(SIZE / 2)),
    MARITIAL_STATUS = sample(MARITIAL_STATUS, Weights(MARITIAL_WEIGHTS_F), Int(SIZE / 2)),
    SEX = repeat(['F'], Int(SIZE / 2)),
    INCOME = sample(INCOME, Weights(INCOME_WEIGHTS_F), Int(SIZE / 2)),
)
disaggregated_dependent_population = reduce(vcat, [population_f, population_m])

dependent_population = combine(groupby(disaggregated_dependent_population, names(disaggregated_dependent_population), sort=true), nrow)
rename!(dependent_population, :nrow => :population)
sort!(dependent_population, [:INCOME, :SEX, :MARITIAL_STATUS, :AGE])

Row,AGE,MARITIAL_STATUS,SEX,INCOME,population
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64
1,5,Divorced,F,40000,1496
2,10,Divorced,F,40000,1422
3,15,Divorced,F,40000,1442
4,20,Divorced,F,40000,1457
5,25,Divorced,F,40000,1448
6,30,Divorced,F,40000,1503
7,35,Divorced,F,40000,1481
8,40,Divorced,F,40000,1457
9,45,Divorced,F,40000,1503
10,50,Divorced,F,40000,1494


### 1.2. Compute marginal attribute distributions of the dependent population
Then we compute the marginal population attributes. This is a type of data that we can usually get from the census data and will be input for our algorithms for population generation.

In [45]:
dependent_age_sex = combine(groupby(disaggregated_dependent_population, [:AGE, :SEX], sort=true), nrow); sort!(dependent_age_sex, [:SEX, :AGE])
dependent_sex_maritial = combine(groupby(disaggregated_dependent_population, [:MARITIAL_STATUS, :SEX], sort=true), nrow); sort!(dependent_sex_maritial, [:SEX, :MARITIAL_STATUS])
dependent_income = combine(groupby(disaggregated_dependent_population, [:INCOME], sort=true), nrow)
dependent_age_sex, dependent_sex_maritial, dependent_income = map(x -> rename!(x, :nrow => :population), [dependent_age_sex, dependent_sex_maritial, dependent_income])

3-element Vector{DataFrame}:
 [1m32×3 DataFrame[0m
[1m Row [0m│[1m AGE   [0m[1m SEX  [0m[1m population [0m
     │[90m Int64 [0m[90m Char [0m[90m Int64      [0m
─────┼─────────────────────────
   1 │     5  F          18782
   2 │    10  F          18321
   3 │    15  F          18719
   4 │    20  F          18932
   5 │    25  F          18443
   6 │    30  F          18839
   7 │    35  F          18703
   8 │    40  F          18816
  ⋮  │   ⋮     ⋮        ⋮
  26 │    50  M          18649
  27 │    55  M          18908
  28 │    60  M          18876
  29 │    65  M          18988
  30 │    70  M          18826
  31 │    75  M          18426
  32 │    80  M          18535
[36m                17 rows omitted[0m
 [1m8×3 DataFrame[0m
[1m Row [0m│[1m MARITIAL_STATUS [0m[1m SEX  [0m[1m population [0m
     │[90m String          [0m[90m Char [0m[90m Int64      [0m
─────┼───────────────────────────────────
   1 │ Divorced         F          59900
   2 │ Marr

### 1.3. Generate dependent population from marginals
Then, we use our algorithm to estimate joint distribution of the marginal attributes. 

In [46]:
#dependent population
generated_dependent_aggregated_population = generate_joint_distributions(dependent_age_sex, dependent_sex_maritial, dependent_income)
generated_dependent_aggregated_population = generated_dependent_aggregated_population[:, Not(:id)]

┌ Info: Converged in 2 iterations.
└ @ ProportionalFitting /Users/marcinzurek/.julia/packages/ProportionalFitting/gNJEu/src/ipf.jl:130
┌ Info: Converged in 2 iterations.
└ @ ProportionalFitting /Users/marcinzurek/.julia/packages/ProportionalFitting/gNJEu/src/ipf.jl:130
┌ Info: Inconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [600001, 600000]
└ @ ProportionalFitting /Users/marcinzurek/.julia/packages/ProportionalFitting/gNJEu/src/ipf.jl:61
┌ Info: Converged in 1 iterations.
└ @ ProportionalFitting /Users/marcinzurek/.julia/packages/ProportionalFitting/gNJEu/src/ipf.jl:130


Row,AGE,MARITIAL_STATUS,SEX,INCOME,population
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64
1,5,Divorced,F,40000,753
2,10,Divorced,F,40000,734
3,15,Divorced,F,40000,750
4,20,Divorced,F,40000,759
5,25,Divorced,F,40000,739
6,30,Divorced,F,40000,755
7,35,Divorced,F,40000,750
8,40,Divorced,F,40000,754
9,45,Divorced,F,40000,763
10,50,Divorced,F,40000,765


### 1.4. Evaluation of fit of generated dependent population.
Finally, we evaluate if the generated population is correct. We use approach based on Z-score described by Edwards, Tanton, 2013 [1].


[1] Edwards, K. L., & Tanton, R. (2013). Validation of spatial microsimulation models. Spatial microsimulation: A reference guide for users, 249-258. https://ndl.ethernet.edu.et/bitstream/123456789/14722/1/205.pdf#page=38

In [47]:
sort!(generated_dependent_aggregated_population)
sort!(dependent_population)
dependent_population.:estimated_population = generated_dependent_aggregated_population.:population

p = dependent_population.:population/sum(dependent_population.:population)
t = dependent_population.:estimated_population/sum(dependent_population.:population)
N = sum(dependent_population.:population)
dependent_population.:Z_score = (p .- t) ./ sqrt.((p .* (1 .- p)) ./ N)
dependent_population

Row,AGE,MARITIAL_STATUS,SEX,INCOME,population,estimated_population,Z_score
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64,Int64,Float64
1,5,Divorced,F,40000,1496,753,19.2338
2,5,Divorced,F,50000,1103,748,10.6989
3,5,Divorced,F,60000,744,750,-0.220107
4,5,Divorced,F,70000,381,749,-18.8592
5,5,Divorced,F,80000,33,750,-124.817
6,5,Divorced,M,40000,61,1139,-138.031
7,5,Divorced,M,50000,587,1132,-22.5056
8,5,Divorced,M,60000,1119,1135,-0.478752
9,5,Divorced,M,70000,1769,1134,15.12
10,5,Divorced,M,80000,2143,1135,21.8136


In [48]:
#percentage of well-fitting values
wfv = count(i -> (-1.96<i<1.96), dependent_population.Z_score) / nrow(dependent_population)
print("Percentage of well fitting values: ", wfv, "\n")

#does the table have a good fit?
cv = sum(dependent_population.Z_score .^ 2) #much more than critical value for chi2 with 640 degrees of freedem -> bad fit
print("Table does not have good fit. Critical value is 640 and our calculated statistic has value: ", cv)

Percentage of well fitting values: 0.1953125
Table does not have good fit. Critical value is 640 and our calculated statistic has value: 2.388516533436027e6

# 2. Population with independent variables

### 2.1. Generate target population with independent variables
First we generate target population that we will treat as the real population which we'll try to synthesize using available algorithms. All variables are independent.

In [49]:
SIZE = 600000
SEX = ['M', 'F']; SEX_WEIGHTS = [0.5, 0.5]
MARITIAL_STATUS = ["Not_married", "Married", "Divorced", "Widowed"]; MARITIAL_WEIGHTS = [0.3, 0.5, 0.1, 0.1]
AGE = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]; AGE_WEIGHTS = repeat([1 / 16], 16)
INCOME = [40000, 50000, 60000, 70000, 80000]; INCOME_WEIGHTS = [0.2, 0.2, 0.2, 0.2, 0.2]

using StatsBase
disaggregated_independent_population = DataFrame(
    AGE = sample(AGE, Weights(AGE_WEIGHTS), SIZE),
    MARITIAL_STATUS = sample(MARITIAL_STATUS, Weights(MARITIAL_WEIGHTS), SIZE),
    SEX = sample(SEX, Weights(SEX_WEIGHTS), SIZE),
    INCOME = sample(INCOME, Weights(INCOME_WEIGHTS), SIZE),
)

independent_population = combine(groupby(disaggregated_independent_population, names(disaggregated_independent_population), sort=true), nrow)
rename!(independent_population, :nrow => :population)
sort!(dependent_population, [:INCOME, :SEX, :MARITIAL_STATUS, :AGE])

Row,AGE,MARITIAL_STATUS,SEX,INCOME,population,estimated_population,Z_score
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64,Int64,Float64
1,5,Divorced,F,40000,1496,753,19.2338
2,10,Divorced,F,40000,1422,734,18.2664
3,15,Divorced,F,40000,1442,750,18.2451
4,20,Divorced,F,40000,1457,759,18.3085
5,25,Divorced,F,40000,1448,739,18.6546
6,30,Divorced,F,40000,1503,755,19.3182
7,35,Divorced,F,40000,1481,750,19.0185
8,40,Divorced,F,40000,1457,754,18.4397
9,45,Divorced,F,40000,1503,763,19.1116
10,50,Divorced,F,40000,1494,765,18.884


### 2.2. Compute marginal attribute distributions of the independent population
Then we compute the marginal population attributes. This is a type of data that we can usually get from the census data and will be input for our algorithms for population generation.

In [50]:
independent_age_sex = combine(groupby(disaggregated_independent_population, [:AGE, :SEX], sort=true), nrow); sort!(independent_age_sex, [:SEX, :AGE])
independent_sex_maritial = combine(groupby(disaggregated_independent_population, [:MARITIAL_STATUS, :SEX], sort=true), nrow); sort!(independent_sex_maritial, [:SEX, :MARITIAL_STATUS])
independent_income = combine(groupby(disaggregated_independent_population, [:INCOME], sort=true), nrow);
independent_age_sex, independent_sex_maritial, independent_income = map(x -> rename!(x, :nrow => :population), [independent_age_sex, independent_sex_maritial, independent_income])

3-element Vector{DataFrame}:
 [1m32×3 DataFrame[0m
[1m Row [0m│[1m AGE   [0m[1m SEX  [0m[1m population [0m
     │[90m Int64 [0m[90m Char [0m[90m Int64      [0m
─────┼─────────────────────────
   1 │     5  F          18623
   2 │    10  F          19012
   3 │    15  F          18684
   4 │    20  F          18835
   5 │    25  F          18680
   6 │    30  F          18711
   7 │    35  F          18655
   8 │    40  F          18752
  ⋮  │   ⋮     ⋮        ⋮
  26 │    50  M          19167
  27 │    55  M          18707
  28 │    60  M          18716
  29 │    65  M          18645
  30 │    70  M          18690
  31 │    75  M          18805
  32 │    80  M          18746
[36m                17 rows omitted[0m
 [1m8×3 DataFrame[0m
[1m Row [0m│[1m MARITIAL_STATUS [0m[1m SEX  [0m[1m population [0m
     │[90m String          [0m[90m Char [0m[90m Int64      [0m
─────┼───────────────────────────────────
   1 │ Divorced         F          29919
   2 │ Marr

### 2.3. Generate dependent population from marginals
Then, we use our algorithm to estimate joint distribution of the marginal attributes. 

In [51]:
#dependent population
generated_aggregated_indep_population = generate_joint_distributions(independent_age_sex, independent_sex_maritial, independent_income)
generated_aggregated_indep_population = generated_aggregated_indep_population[:, Not(:id)]

┌ Info: Converged in 2 iterations.
└ @ ProportionalFitting /Users/marcinzurek/.julia/packages/ProportionalFitting/gNJEu/src/ipf.jl:130
┌ Info: Converged in 2 iterations.
└ @ ProportionalFitting /Users/marcinzurek/.julia/packages/ProportionalFitting/gNJEu/src/ipf.jl:130
┌ Info: Inconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [599999, 600000]
└ @ ProportionalFitting /Users/marcinzurek/.julia/packages/ProportionalFitting/gNJEu/src/ipf.jl:61
┌ Info: Converged in 1 iterations.
└ @ ProportionalFitting /Users/marcinzurek/.julia/packages/ProportionalFitting/gNJEu/src/ipf.jl:130


Row,AGE,MARITIAL_STATUS,SEX,INCOME,population
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64
1,5,Divorced,F,40000,371
2,10,Divorced,F,40000,379
3,15,Divorced,F,40000,372
4,20,Divorced,F,40000,375
5,25,Divorced,F,40000,372
6,30,Divorced,F,40000,373
7,35,Divorced,F,40000,372
8,40,Divorced,F,40000,374
9,45,Divorced,F,40000,369
10,50,Divorced,F,40000,372


### 2.4. Evaluation of fit of generated dependent population.
Finally, we evaluate if the generated population is correct. We use approach based on Z-score described by Edwards, Tanton, 2013 [1].


[1] Edwards, K. L., & Tanton, R. (2013). Validation of spatial microsimulation models. Spatial microsimulation: A reference guide for users, 249-258. https://ndl.ethernet.edu.et/bitstream/123456789/14722/1/205.pdf#page=38

In [52]:
sort!(generated_aggregated_indep_population)
sort!(independent_population)
independent_population.:estimated_population = generated_aggregated_indep_population.:population

p = independent_population.:population/sum(independent_population.:population)
t = independent_population.:estimated_population/sum(independent_population.:population)
N = sum(independent_population.:population)
independent_population.:Z_score = (p .- t) ./ sqrt.((p .* (1 .- p)) ./ N)

independent_population

Row,AGE,MARITIAL_STATUS,SEX,INCOME,population,estimated_population,Z_score
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64,Int64,Float64
1,5,Divorced,F,40000,351,371,-1.06783
2,5,Divorced,F,50000,377,371,0.309113
3,5,Divorced,F,60000,369,373,-0.208296
4,5,Divorced,F,70000,305,371,-3.78011
5,5,Divorced,F,80000,386,372,0.712811
6,5,Divorced,M,40000,379,368,0.565211
7,5,Divorced,M,50000,352,369,-0.906369
8,5,Divorced,M,60000,374,370,0.2069
9,5,Divorced,M,70000,376,369,0.361111
10,5,Divorced,M,80000,360,369,-0.474484


In [53]:
#percentage of well-fitting values
wfv = count(i -> (-1.96<i<1.96), independent_population.Z_score) / nrow(independent_population)
print("Percentage of well fitting values: ", wfv, "\n")

#does the table have a good fit?
cv = sum(independent_population.Z_score .^ 2) #much more than critical value for chi2 with 640 degrees of freedem -> bad fit
print("Table does have good fit. Critical value is 640 and our calculated statistic has value: ", cv)

Percentage of well fitting values: 0.940625
Table does have good fit. Critical value is 640 and our calculated statistic has value: 

697.2591193860881