In [1]:
include("../src/SyntheticPopulation.jl")

# Summary:
1. Sample-free IPF algorithm is well suited for creating joint distributions of the attributes that are independent (94% of cells fit well).
2. Sample-free IPF algorithm is not so efficient for creating joint distributions of the attributes that are highly dependent (20% of cells fit well).
3. To overcome the challenge, we provide a configurable config file which helps better adjust IPF algorithm. The config is described in another file.

# 1. Population with dependent variables

### 1.1. Generate target population with dependent variables
First we generate target population that we'll try to synthesize using available algorithms. The variables marital status and income are highly dependent on sex.

In [3]:
SIZE = 600000
SEX = ['M', 'F']; SEX_WEIGHTS = [0.5, 0.5]
MARITAL_STATUS = ["Not_married", "Married", "Divorced", "Widowed"]; 
MARITAL_STATUS_WEIGHTS_M = [0.1, 0.2, 0.3, 0.4]; 
MARITAL_STATUS_WEIGHTS_F = [0.4, 0.3, 0.2, 0.1];
AGE = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]; 
AGE_WEIGHTS = repeat([1 / 16], 16)
INCOME = [40000, 50000, 60000, 70000, 80000]; 
INCOME_WEIGHTS_M = [0.01, 0.1, 0.2, 0.3, 0.39];
INCOME_WEIGHTS_F = [0.39, 0.3, 0.2, 0.1, 0.01];

using StatsBase
population_m = DataFrame(
    AGE = sample(AGE, Weights(AGE_WEIGHTS),Int(SIZE / 2)),
    MARITAL_STATUS = sample(MARITAL_STATUS, Weights(MARITAL_STATUS_WEIGHTS_M), Int(SIZE / 2)),
    SEX = repeat(['M'], Int(SIZE / 2)),
    INCOME = sample(INCOME, Weights(INCOME_WEIGHTS_M), Int(SIZE / 2)),
)
population_f = DataFrame(
    AGE = sample(AGE, Weights(AGE_WEIGHTS), Int(SIZE / 2)),
    MARITAL_STATUS = sample(MARITAL_STATUS, Weights(MARITAL_STATUS_WEIGHTS_F), Int(SIZE / 2)),
    SEX = repeat(['F'], Int(SIZE / 2)),
    INCOME = sample(INCOME, Weights(INCOME_WEIGHTS_F), Int(SIZE / 2)),
)
disaggregated_dependent_population = reduce(vcat, [population_m, population_f])

dependent_population = combine(groupby(disaggregated_dependent_population, names(disaggregated_dependent_population), sort=true), nrow)
rename!(dependent_population, :nrow => :population)
sort!(dependent_population, [:INCOME, :SEX, :MARITAL_STATUS, :AGE])

Row,AGE,MARITAL_STATUS,SEX,INCOME,population
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64
1,5,Divorced,F,40000,1443
2,10,Divorced,F,40000,1462
3,15,Divorced,F,40000,1405
4,20,Divorced,F,40000,1416
5,25,Divorced,F,40000,1397
6,30,Divorced,F,40000,1410
7,35,Divorced,F,40000,1539
8,40,Divorced,F,40000,1541
9,45,Divorced,F,40000,1433
10,50,Divorced,F,40000,1394


### 1.2. Compute marginal attribute distributions of the dependent population
Then we compute the marginal population attributes. This is a type of data that we can usually get from the census data and will be input for our algorithms for population generation.

In [6]:
dependent_age_sex = combine(groupby(disaggregated_dependent_population, [:AGE, :SEX], sort=true), nrow); sort!(dependent_age_sex, [:SEX, :AGE])
dependent_sex_marital = combine(groupby(disaggregated_dependent_population, [:MARITAL_STATUS, :SEX], sort=true), nrow); sort!(dependent_sex_marital, [:SEX, :MARITAL_STATUS])
dependent_income = combine(groupby(disaggregated_dependent_population, [:INCOME], sort=true), nrow)
dependent_age_sex, dependent_sex_marital, dependent_income = map(x -> rename!(x, :nrow => :population), [dependent_age_sex, dependent_sex_marital, dependent_income])

3-element Vector{DataFrame}:
 [1m32×3 DataFrame[0m
[1m Row [0m│[1m AGE   [0m[1m SEX  [0m[1m population [0m
     │[90m Int64 [0m[90m Char [0m[90m Int64      [0m
─────┼─────────────────────────
   1 │     5  F          18503
   2 │    10  F          18844
   3 │    15  F          18729
   4 │    20  F          18751
   5 │    25  F          18642
   6 │    30  F          18799
   7 │    35  F          18839
   8 │    40  F          18834
   9 │    45  F          18785
  10 │    50  F          18755
  11 │    55  F          18602
  ⋮  │   ⋮     ⋮        ⋮
  23 │    35  M          18976
  24 │    40  M          18706
  25 │    45  M          18492
  26 │    50  M          18861
  27 │    55  M          18457
  28 │    60  M          18736
  29 │    65  M          18926
  30 │    70  M          18758
  31 │    75  M          18840
  32 │    80  M          18727
[36m                11 rows omitted[0m
 [1m8×3 DataFrame[0m
[1m Row [0m│[1m MARITAL_STATUS [0m[1m SEX  [0

### 1.3. Generate dependent population from marginals
Then, we use our algorithm to estimate joint distribution of the attributes. 

In [7]:
#dependent population
generated_dependent_aggregated_population = generate_joint_distribution(dependent_age_sex, dependent_sex_marital, dependent_income)
generated_dependent_aggregated_population = generated_dependent_aggregated_population[:, Not(:id)]

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 2 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 2 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [599997, 600000]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.


Row,AGE,MARITAL_STATUS,SEX,INCOME,population
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64
1,5,Divorced,F,40000,738
2,10,Divorced,F,40000,752
3,15,Divorced,F,40000,747
4,20,Divorced,F,40000,748
5,25,Divorced,F,40000,744
6,30,Divorced,F,40000,750
7,35,Divorced,F,40000,752
8,40,Divorced,F,40000,752
9,45,Divorced,F,40000,750
10,50,Divorced,F,40000,748


### 1.4. Evaluation of fit of generated dependent population.
Finally, we evaluate if the generated population is correct. We use approach based on Z-score described by [Williamson, 2013] [1].


[1] Williamson, P. (2013). An evaluation of two synthetic small-area microdata simulation methodologies: Synthetic reconstruction and combinatorial optimisation. Spatial microsimulation: A reference guide for users, 19-47. https://ndl.ethernet.edu.et/bitstream/123456789/14722/1/205.pdf#page=38

In [8]:
sort!(generated_dependent_aggregated_population)
sort!(dependent_population)
dependent_population.:estimated_population = generated_dependent_aggregated_population.:population

p = dependent_population.:population/sum(dependent_population.:population)
t = dependent_population.:estimated_population/sum(dependent_population.:population)
N = sum(dependent_population.:population)
dependent_population.:Z_score = (p .- t) ./ sqrt.((p .* (1 .- p)) ./ N)
dependent_population

Row,AGE,MARITAL_STATUS,SEX,INCOME,population,estimated_population,Z_score
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64,Int64,Float64
1,5,Divorced,F,40000,1443,738,18.5814
2,5,Divorced,F,50000,1088,739,10.5902
3,5,Divorced,F,60000,744,740,0.146738
4,5,Divorced,F,70000,373,741,-19.0602
5,5,Divorced,F,80000,27,739,-137.028
6,5,Divorced,M,40000,47,1114,-155.644
7,5,Divorced,M,50000,572,1116,-22.7566
8,5,Divorced,M,60000,1111,1117,-0.180176
9,5,Divorced,M,70000,1687,1118,13.8729
10,5,Divorced,M,80000,2167,1116,22.6182


In [10]:
#percentage of well-fitting values
wfv = count(i -> (-1.96<i<1.96), dependent_population.Z_score) / nrow(dependent_population)
print("Percentage of well fitting values: ", wfv, "\n")

#does the table have a good fit?
cv = sum(dependent_population.Z_score .^ 2) #much more than critical value for chi2 with 640 degrees of freedem -> bad fit
print("Table does not have good fit. Critical value is 640 and our calculated statistic has value: ", cv)

Percentage of well fitting values: 0.1890625
Table does not have good fit. Critical value is 640 and our calculated statistic has value: 2.382529822692407e6

# 2. Population with independent variables

### 2.1. Generate target population with independent variables
First we generate target population that we'll try to synthesize using available algorithms. All variables are independent.

In [13]:
SIZE = 600000
SEX = ['M', 'F']; SEX_WEIGHTS = [0.5, 0.5]
MARITAL_STATUS = ["Not_married", "Married", "Divorced", "Widowed"]; MARITAL_WEIGHTS = [0.3, 0.5, 0.1, 0.1]
AGE = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]; AGE_WEIGHTS = repeat([1 / 16], 16)
INCOME = [40000, 50000, 60000, 70000, 80000]; INCOME_WEIGHTS = [0.2, 0.2, 0.2, 0.2, 0.2]

using StatsBase
disaggregated_independent_population = DataFrame(
    AGE = sample(AGE, Weights(AGE_WEIGHTS), SIZE),
    MARITAL_STATUS = sample(MARITAL_STATUS, Weights(MARITAL_WEIGHTS), SIZE),
    SEX = sample(SEX, Weights(SEX_WEIGHTS), SIZE),
    INCOME = sample(INCOME, Weights(INCOME_WEIGHTS), SIZE),
)

independent_population = combine(groupby(disaggregated_independent_population, names(disaggregated_independent_population), sort=true), nrow)
rename!(independent_population, :nrow => :population)
sort!(dependent_population, [:INCOME, :SEX, :MARITAL_STATUS, :AGE])

Row,AGE,MARITAL_STATUS,SEX,INCOME,population,estimated_population,Z_score
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64,Int64,Float64
1,5,Divorced,F,40000,1443,738,18.5814
2,10,Divorced,F,40000,1462,752,18.5915
3,15,Divorced,F,40000,1405,747,17.5751
4,20,Divorced,F,40000,1416,748,17.7729
5,25,Divorced,F,40000,1397,744,17.4913
6,30,Divorced,F,40000,1410,750,17.5973
7,35,Divorced,F,40000,1539,752,20.0869
8,40,Divorced,F,40000,1541,752,20.1249
9,45,Divorced,F,40000,1433,750,18.0641
10,50,Divorced,F,40000,1394,748,17.3223


### 2.2. Compute marginal attribute distributions of the independent population
Then we compute the distributions of population attributes. This is a type of data that we can usually get from the census data and will be input for our algorithms for population generation.

In [15]:
independent_age_sex = combine(groupby(disaggregated_independent_population, [:AGE, :SEX], sort=true), nrow); sort!(independent_age_sex, [:SEX, :AGE])
independent_sex_marital = combine(groupby(disaggregated_independent_population, [:MARITAL_STATUS, :SEX], sort=true), nrow); sort!(independent_sex_marital, [:SEX, :MARITAL_STATUS])
independent_income = combine(groupby(disaggregated_independent_population, [:INCOME], sort=true), nrow);
independent_age_sex, independent_sex_marital, independent_income = map(x -> rename!(x, :nrow => :population), [independent_age_sex, independent_sex_marital, independent_income])

3-element Vector{DataFrame}:
 [1m32×3 DataFrame[0m
[1m Row [0m│[1m AGE   [0m[1m SEX  [0m[1m population [0m
     │[90m Int64 [0m[90m Char [0m[90m Int64      [0m
─────┼─────────────────────────
   1 │     5  F          19029
   2 │    10  F          18865
   3 │    15  F          18997
   4 │    20  F          18662
   5 │    25  F          18735
   6 │    30  F          18712
   7 │    35  F          18878
   8 │    40  F          18555
   9 │    45  F          18819
  10 │    50  F          18570
  11 │    55  F          18636
  ⋮  │   ⋮     ⋮        ⋮
  23 │    35  M          18797
  24 │    40  M          18724
  25 │    45  M          18704
  26 │    50  M          18827
  27 │    55  M          18683
  28 │    60  M          18795
  29 │    65  M          18777
  30 │    70  M          18744
  31 │    75  M          18773
  32 │    80  M          18663
[36m                11 rows omitted[0m
 [1m8×3 DataFrame[0m
[1m Row [0m│[1m MARITAL_STATUS [0m[1m SEX  [0

### 2.3. Generate dependent population from marginals
Then, we use our algorithm to estimate joint distribution of the attributes. 

In [16]:
#dependent population
generated_aggregated_indep_population = generate_joint_distribution(independent_age_sex, independent_sex_marital, independent_income)
generated_aggregated_indep_population = generated_aggregated_indep_population[:, Not(:id)]

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 2 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 2 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 2 iterations.


Row,AGE,MARITAL_STATUS,SEX,INCOME,population
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64
1,5,Divorced,F,40000,379
2,10,Divorced,F,40000,376
3,15,Divorced,F,40000,379
4,20,Divorced,F,40000,372
5,25,Divorced,F,40000,373
6,30,Divorced,F,40000,373
7,35,Divorced,F,40000,376
8,40,Divorced,F,40000,370
9,45,Divorced,F,40000,375
10,50,Divorced,F,40000,370


### 2.4. Evaluation of fit of generated dependent population.
Finally, we evaluate if the generated population is correct. We use approach based on Z-score described by [Williamson, 2013] [1].


[1] Williamson, P. (2013). An evaluation of two synthetic small-area microdata simulation methodologies: Synthetic reconstruction and combinatorial optimisation. Spatial microsimulation: A reference guide for users, 19-47. https://ndl.ethernet.edu.et/bitstream/123456789/14722/1/205.pdf#page=38

In [17]:
sort!(generated_aggregated_indep_population)
sort!(independent_population)
independent_population.:estimated_population = generated_aggregated_indep_population.:population

p = independent_population.:population/sum(independent_population.:population)
t = independent_population.:estimated_population/sum(independent_population.:population)
N = sum(independent_population.:population)
independent_population.:Z_score = (p .- t) ./ sqrt.((p .* (1 .- p)) ./ N)

independent_population

Row,AGE,MARITAL_STATUS,SEX,INCOME,population,estimated_population,Z_score
Unnamed: 0_level_1,Int64,String,Char,Int64,Int64,Int64,Float64
1,5,Divorced,F,40000,423,379,2.14011
2,5,Divorced,F,50000,379,377,0.102766
3,5,Divorced,F,60000,382,379,0.153542
4,5,Divorced,F,70000,352,378,-1.38621
5,5,Divorced,F,80000,393,378,0.756898
6,5,Divorced,M,40000,374,370,0.2069
7,5,Divorced,M,50000,359,368,-0.475144
8,5,Divorced,M,60000,373,370,0.155382
9,5,Divorced,M,70000,359,368,-0.475144
10,5,Divorced,M,80000,384,369,0.765711


In [18]:
#percentage of well-fitting values
wfv = count(i -> (-1.96<i<1.96), independent_population.Z_score) / nrow(independent_population)
print("Percentage of well fitting values: ", wfv, "\n")

#does the table have a good fit?
cv = sum(independent_population.Z_score .^ 2) #much more than critical value for chi2 with 640 degrees of freedem -> bad fit
print("Table does have good fit. Critical value is 640 and our calculated statistic has value: ", cv)

Percentage of well fitting values: 0.959375
Table does have good fit. Critical value is 640 and our calculated statistic has value: 582.668278043634