In [19]:
include("../src/SyntheticPopulation.jl")

# Summary:
1. Sample-free IPF algorithm is well suited for creating joint distributions of the attributes that are independent (96% of cells fit well).
2. Sample-free IPF algorithm is not so efficient for creating joint distributions of the attributes that are highly dependent (10% of cells fit well).
3. To overcome the challenge, we provide a configurable config file which helps better adjust IPF algorithm. The config is described in another file.

# 1. Population with dependent variables

### 1.1. Generate target population with dependent variables
First we generate target population that we will treat as the real population which we'll try to synthesize using available algorithms. The variables maritial status and income are highly dependent on sex.

In [40]:
SIZE = 600000
SEX = ['M', 'F']; SEX_WEIGHTS = [0.5, 0.5]
MARITIAL_STATUS = ["Not_married", "Married", "Divorced", "Widowed"]; 
MARITIAL_WEIGHTS_M = [0.1, 0.2, 0.3, 0.4]; 
MARITIAL_WEIGHTS_F = [0.4, 0.3, 0.2, 0.1];
AGE = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]; 
AGE_WEIGHTS = repeat([1 / 16], 16)
INCOME = [40000, 50000, 60000, 70000, 80000]; 
INCOME_WEIGHTS_M = [0.01, 0.1, 0.2, 0.3, 0.39];
INCOME_WEIGHTS_F = [0.39, 0.3, 0.2, 0.1, 0.01];


using StatsBase
population_m = DataFrame(
    AGE = sample(AGE, Weights(AGE_WEIGHTS),Int(SIZE / 2)),
    MARITIAL_STATUS = sample(MARITIAL_STATUS, Weights(MARITIAL_WEIGHTS_M), Int(SIZE / 2)),
    SEX = repeat(['M'], Int(SIZE / 2)),
    INCOME = sample(INCOME, Weights(INCOME_WEIGHTS_M), Int(SIZE / 2)),
)
population_f = DataFrame(
    AGE = sample(AGE, Weights(AGE_WEIGHTS), Int(SIZE / 2)),
    MARITIAL_STATUS = sample(MARITIAL_STATUS, Weights(MARITIAL_WEIGHTS_F), Int(SIZE / 2)),
    SEX = repeat(['F'], Int(SIZE / 2)),
    INCOME = sample(INCOME, Weights(INCOME_WEIGHTS_F), Int(SIZE / 2)),
)
disaggregated_dependent_population = reduce(vcat, [population_f, population_m])

dependent_population = combine(groupby(disaggregated_dependent_population, names(disaggregated_dependent_population), sort=true), nrow)
rename!(dependent_population, :nrow => :population)
sort!(dependent_population, [:INCOME, :SEX, :MARITIAL_STATUS, :AGE])

Row,AGE,MARITIAL_STATUS,SEX,INCOME,population
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64
1,5,Divorced,F,40000,1453
2,10,Divorced,F,40000,1466
3,15,Divorced,F,40000,1475
4,20,Divorced,F,40000,1414
5,25,Divorced,F,40000,1470
6,30,Divorced,F,40000,1478
7,35,Divorced,F,40000,1408
8,40,Divorced,F,40000,1480
9,45,Divorced,F,40000,1467
10,50,Divorced,F,40000,1408


### 1.2. Compute marginal attribute distributions of the dependent population
Then we compute the marginal population attributes. This is a type of data that we can usually get from the census data and will be input for our algorithms for population generation.

In [25]:
dependent_age_sex = combine(groupby(disaggregated_dependent_population, [:AGE, :SEX], sort=true), nrow); sort!(dependent_age_sex, [:SEX, :AGE])
dependent_sex_maritial = combine(groupby(disaggregated_dependent_population, [:MARITIAL_STATUS, :SEX], sort=true), nrow); sort!(dependent_sex_maritial, [:SEX, :MARITIAL_STATUS])
dependent_income = combine(groupby(disaggregated_dependent_population, [:INCOME], sort=true), nrow)
dependent_age_sex, dependent_sex_maritial, dependent_income = map(x -> rename!(x, :nrow => :population), [dependent_age_sex, dependent_sex_maritial, dependent_income])

3-element Vector{DataFrame}:
 [1m32×3 DataFrame[0m
[1m Row [0m│[1m AGE   [0m[1m SEX  [0m[1m population [0m
     │[90m Int64 [0m[90m Char [0m[90m Int64      [0m
─────┼─────────────────────────
   1 │     5  F          18760
   2 │    10  F          18783
   3 │    15  F          18815
   4 │    20  F          18805
   5 │    25  F          18765
   6 │    30  F          18841
   7 │    35  F          18619
   8 │    40  F          18794
   9 │    45  F          18779
  10 │    50  F          18771
  11 │    55  F          18730
  ⋮  │   ⋮     ⋮        ⋮
  23 │    35  M          18590
  24 │    40  M          18823
  25 │    45  M          18927
  26 │    50  M          18905
  27 │    55  M          18773
  28 │    60  M          19025
  29 │    65  M          18566
  30 │    70  M          18476
  31 │    75  M          18736
  32 │    80  M          18718
[36m                11 rows omitted[0m
 [1m8×3 DataFrame[0m
[1m Row [0m│[1m MARITIAL_STATUS [0m[1m SEX  [

### 1.3. Generate dependent population from marginals
Then, we use our algorithm to estimate joint distribution of the marginal attributes. 

In [52]:
#dependent population
generated_dependent_population, aggregated_population = generate_joint_distributions(dependent_age_sex, dependent_sex_maritial, dependent_income)
generated_dependent_population.:population = aggregated_population.:population
generated_dependent_population[:, Not(:id)]

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 2 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 2 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [599996, 600000]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.


Row,AGE,MARITIAL_STATUS,SEX,INCOME,population
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64
1,5,Divorced,F,40000,747
2,10,Divorced,F,40000,748
3,15,Divorced,F,40000,749
4,20,Divorced,F,40000,749
5,25,Divorced,F,40000,747
6,30,Divorced,F,40000,750
7,35,Divorced,F,40000,741
8,40,Divorced,F,40000,748
9,45,Divorced,F,40000,748
10,50,Divorced,F,40000,747


### 1.4. Evaluation of fit of generated dependent population.
Finally, we evaluate if the generated population is correct. We use approach based on Z-score described by Edwards, Tanton, 2013 [1].


[1] Edwards, K. L., & Tanton, R. (2013). Validation of spatial microsimulation models. Spatial microsimulation: A reference guide for users, 249-258. https://ndl.ethernet.edu.et/bitstream/123456789/14722/1/205.pdf#page=38

In [53]:
sort!(generated_dependent_population)
sort!(dependent_population)
dependent_population.:estimated_population = generated_dependent_population.:population

p = dependent_population.:population/sum(dependent_population.:population)
t = dependent_population.:estimated_population/sum(dependent_population.:population)
N = sum(dependent_population.:population)
dependent_population.:Z_score = (p .- t) ./ sqrt.((p .* (1 .- p)) ./ N)
dependent_population

Row,AGE,MARITIAL_STATUS,SEX,INCOME,population,estimated_population,Z_score
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64,Int64,Float64
1,5,Divorced,F,40000,1453,747,18.5438
2,5,Divorced,F,50000,1124,748,11.2257
3,5,Divorced,F,60000,743,749,-0.220255
4,5,Divorced,F,70000,400,749,-17.4558
5,5,Divorced,F,80000,50,747,-98.5748
6,5,Divorced,M,40000,70,750,-81.2803
7,5,Divorced,M,50000,597,741,-5.89646
8,5,Divorced,M,60000,1129,748,11.3498
9,5,Divorced,M,70000,1646,748,22.1645
10,5,Divorced,M,80000,2210,747,31.1781


In [62]:
#percentage of well-fitting values
wfv = count(i -> (-1.96<i<1.96), dependent_population.Z_score) / nrow(dependent_population)
print("Percentage of well fitting values: ", wfv, "\n")

#does the table have a good fit?
cv = sum(dependent_population.Z_score .^ 2) #much more than critical value for chi2 with 640 degrees of freedem -> bad fit
print("Table does not have good fit. Critical value is 640 and our calculated statistic has value: ", cv)

Percentage of well fitting values: 0.0984375
Table does not have good fit. Critical value is 640 and our calculated statistic has value: 3.9665000768291065e6

# 2. Population with independent variables

### 2.1. Generate target population with independent variables
First we generate target population that we will treat as the real population which we'll try to synthesize using available algorithms. All variables are independent.

In [45]:
SIZE = 600000
SEX = ['M', 'F']; SEX_WEIGHTS = [0.5, 0.5]
MARITIAL_STATUS = ["Not_married", "Married", "Divorced", "Widowed"]; MARITIAL_WEIGHTS = [0.3, 0.5, 0.1, 0.1]
AGE = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]; AGE_WEIGHTS = repeat([1 / 16], 16)
INCOME = [40000, 50000, 60000, 70000, 80000]; INCOME_WEIGHTS = [0.2, 0.2, 0.2, 0.2, 0.2]

using StatsBase
disaggregated_independent_population = DataFrame(
    AGE = sample(AGE, Weights(AGE_WEIGHTS), SIZE),
    MARITIAL_STATUS = sample(MARITIAL_STATUS, Weights(MARITIAL_WEIGHTS), SIZE),
    SEX = sample(SEX, Weights(SEX_WEIGHTS), SIZE),
    INCOME = sample(INCOME, Weights(INCOME_WEIGHTS), SIZE),
)

independent_population = combine(groupby(disaggregated_independent_population, names(disaggregated_independent_population), sort=true), nrow)
rename!(independent_population, :nrow => :population)
sort!(dependent_population, [:INCOME, :SEX, :MARITIAL_STATUS, :AGE])

Row,AGE,MARITIAL_STATUS,SEX,INCOME,population
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64
1,5,Divorced,F,40000,1453
2,10,Divorced,F,40000,1466
3,15,Divorced,F,40000,1475
4,20,Divorced,F,40000,1414
5,25,Divorced,F,40000,1470
6,30,Divorced,F,40000,1478
7,35,Divorced,F,40000,1408
8,40,Divorced,F,40000,1480
9,45,Divorced,F,40000,1467
10,50,Divorced,F,40000,1408


### 2.2. Compute marginal attribute distributions of the independent population
Then we compute the marginal population attributes. This is a type of data that we can usually get from the census data and will be input for our algorithms for population generation.

In [46]:
independent_age_sex = combine(groupby(disaggregated_independent_population, [:AGE, :SEX], sort=true), nrow); sort!(independent_age_sex, [:SEX, :AGE])
independent_sex_maritial = combine(groupby(disaggregated_independent_population, [:MARITIAL_STATUS, :SEX], sort=true), nrow); sort!(independent_sex_maritial, [:SEX, :MARITIAL_STATUS])
independent_income = combine(groupby(disaggregated_independent_population, [:INCOME], sort=true), nrow);
independent_age_sex, independent_sex_maritial, independent_income = map(x -> rename!(x, :nrow => :population), [independent_age_sex, independent_sex_maritial, independent_income])

3-element Vector{DataFrame}:
 [1m32×3 DataFrame[0m
[1m Row [0m│[1m AGE   [0m[1m SEX  [0m[1m population [0m
     │[90m Int64 [0m[90m Char [0m[90m Int64      [0m
─────┼─────────────────────────
   1 │     5  F          18722
   2 │    10  F          18709
   3 │    15  F          18600
   4 │    20  F          18792
   5 │    25  F          19081
   6 │    30  F          18902
   7 │    35  F          18599
   8 │    40  F          18766
   9 │    45  F          18876
  10 │    50  F          18992
  11 │    55  F          18753
  ⋮  │   ⋮     ⋮        ⋮
  23 │    35  M          18780
  24 │    40  M          18806
  25 │    45  M          18636
  26 │    50  M          18534
  27 │    55  M          18684
  28 │    60  M          18566
  29 │    65  M          18664
  30 │    70  M          19076
  31 │    75  M          18623
  32 │    80  M          18962
[36m                11 rows omitted[0m
 [1m8×3 DataFrame[0m
[1m Row [0m│[1m MARITIAL_STATUS [0m[1m SEX  [

### 2.3. Generate dependent population from marginals
Then, we use our algorithm to estimate joint distribution of the marginal attributes. 

In [55]:
#dependent population
generated_independent_population, aggegated_indep_population = generate_joint_distributions(independent_age_sex, independent_sex_maritial, independent_income)
generated_independent_population.:population = aggegated_indep_population.:population
generated_independent_population = generated_independent_population[:, Not(:id)]

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 2 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 2 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [599991, 600000]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.


Row,AGE,MARITIAL_STATUS,SEX,INCOME,population
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64
1,5,Divorced,F,40000,377
2,10,Divorced,F,40000,377
3,15,Divorced,F,40000,375
4,20,Divorced,F,40000,379
5,25,Divorced,F,40000,384
6,30,Divorced,F,40000,381
7,35,Divorced,F,40000,375
8,40,Divorced,F,40000,378
9,45,Divorced,F,40000,380
10,50,Divorced,F,40000,383


### 2.4. Evaluation of fit of generated dependent population.
Finally, we evaluate if the generated population is correct. We use approach based on Z-score described by Edwards, Tanton, 2013 [1].


[1] Edwards, K. L., & Tanton, R. (2013). Validation of spatial microsimulation models. Spatial microsimulation: A reference guide for users, 249-258. https://ndl.ethernet.edu.et/bitstream/123456789/14722/1/205.pdf#page=38

In [56]:
sort!(generated_independent_population)
sort!(independent_population)
independent_population.:estimated_population = generated_independent_population.:population

p = independent_population.:population/sum(independent_population.:population)
t = independent_population.:estimated_population/sum(independent_population.:population)
N = sum(independent_population.:population)
independent_population.:Z_score = (p .- t) ./ sqrt.((p .* (1 .- p)) ./ N)

independent_population

Row,AGE,MARITIAL_STATUS,SEX,INCOME,population,estimated_population,Z_score
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64,Int64,Float64
1,5,Divorced,F,40000,363,377,-0.735032
2,5,Divorced,F,50000,334,377,-2.35351
3,5,Divorced,F,60000,411,378,1.62833
4,5,Divorced,F,70000,393,376,0.857818
5,5,Divorced,F,80000,394,377,0.856729
6,5,Divorced,M,40000,421,379,2.04767
7,5,Divorced,M,50000,386,379,0.356405
8,5,Divorced,M,60000,371,380,-0.467401
9,5,Divorced,M,70000,375,378,-0.154968
10,5,Divorced,M,80000,361,378,-0.895006


In [61]:
#percentage of well-fitting values
wfv = count(i -> (-1.96<i<1.96), independent_population.Z_score) / nrow(independent_population)
print("Percentage of well fitting values: ", wfv, "\n")

#does the table have a good fit?
cv = sum(independent_population.Z_score .^ 2) #much more than critical value for chi2 with 640 degrees of freedem -> bad fit
print("Table does have good fit. Critical value is 640 and our calculated statistic has value: ", cv)

Percentage of well fitting values: 0.95625
Table does have good fit. Critical value is 640 and our calculated statistic has value: 587.3337345012303