## Data generation

#### Include needed files

In [1]:
include("src/SyntheticPopulation.jl")

#### Input data
Manual dataframe generation based on real input data from census

In [8]:
#each individual and each household represent 1000 individuals or households
SCALE = 0.001 

#all values are based on China census data
popoulation_size = 21890000


marginal_ind_age_sex = DataFrame(
    sex = repeat(['M', 'F'], 18),
    age = repeat(2:5:87, inner = 2), 
    population = SCALE .* 10000 .* [52.6, 49.0, 48.5, 44.8, 33.6, 30.6, 34.6, 28.8, 71.6, 63.4, 99.6, 90.9, 130.9, 119.4, 110.8, 103.5, 83.8, 76.4, 84.2, 77.7, 84.2, 77.8, 82.8, 79.9, 67.7, 71.0, 56.9, 62.6, 31.5, 35.3, 18.5, 23.0, 15.2, 19.7, 12.5, 16.0]
    )


marginal_ind_sex_maritialstatus = DataFrame(
    sex = repeat(['M', 'F'], 4), 
    maritialstatus = repeat(["Not_married", "Married", "Divorced", "Widowed"], inner = 2), 
    population = SCALE .* [1679, 1611, 5859, 5774, 140, 206, 128, 426] ./ 0.00082
    )


marginal_ind_income = DataFrame(
    income = [25394, 44855, 63969, 88026, 145915], 
    population = repeat([popoulation_size * SCALE / 5], 5)
    )


marginal_hh_size = DataFrame(
    hh_size = [1,2,3,4,5],
    population = Int.(round.(SCALE * 8230000 .* [0.299, 0.331, 0.217, 0.09, 0.063]))
    )

nothing

Generate areas dataframes

In [21]:
#areas
URL = "https://osm-boundaries.com/Download/Submit?apiKey=87100809b4085adb58139419c141e5a1&db=osm20230102&osmIds=-2988894,-2988933,-2988895,-288600,-2988896,-2988946,-5505984,-2988897,-2988898,-2988899,-2988900,-5505985,-2988901,-2988902,-568660,-2988903&format=GeoJSON&srid=4326"
areas = generate_areas_dataframe(URL)

#aggregated_areas - population referenced from https://nj.tjj.beijing.gov.cn/nj/main/2021-tjnj/zk/indexeh.htm
aggregated_areas = copy(areas)
aggregated_areas.:population = Int.(round.(SCALE .* 10000 .* [56.8, 313.2, 201.9, 345.1, 34.6, 184.0, 132.4, 45.7, 52.8, 39.3, 44.1, 131.3, 199.4, 226.9, 110.6, 70.9]))
aggregated_areas = aggregated_areas[:, [:id, :population]]
nothing

Downloading file... 
File downloaded. Unzipping file...
File saved at /Users/marcinzurek/Desktop/Studia/Research/SyntheticPopulation/file.geojson


Generate individuals and household dataframes

In [10]:
individuals, aggregated_individuals = generate_joint_distributions(marginal_ind_sex_maritialstatus, marginal_ind_age_sex, marginal_ind_income, config_file = "config_file.json")
households, aggregated_households = generate_joint_distributions(marginal_hh_size)
nothing

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [9520, 9502]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [9777, 9166]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [18668, 21890]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.


Allocate individuals

In [11]:
aggregated_individuals, disaggregated_households, aggregated_households, unassigned = assign_individuals_to_households!(individuals, aggregated_individuals, households, aggregated_households, return_unassigned = true)
nothing

Total_number of individuals: 21885
Total_number of households: 8230
Allocation started... 


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:04[39m


Allocated 86.0% individuals.
Allocated 100.0% households.


Allocate areas

In [12]:
disaggregated_households, unassigned = assign_areas_to_households!(disaggregated_households, areas, aggregated_areas, return_unassigned = true)
nothing

Assigning coordinates to households...


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:29[39m


## Input and output data format

### Naming convention:
- `dataframe` - unique combinations of dataframe attributes with columns `:id` and multiple columns for each `attribute`.
- `aggregated_dataframe` - aggregated population with columns `:id` and `:population`.
- `disaggregated_dataframe` - disaggregated population.
- `unassigned_dataframe` - aggregated population with columns `:id` and `:unassigned_population`.
- `marginal_attr1_attr2` - flattened contingency tables with marginal distributions

### Dataframes:

`marginal_attr1_attr2`
___________

Description:
- flattened multi-way contingency table with marginal distributions of the variables
- example: `marginal_age_sex`

Columns:
- `:attr1` - 1st attribute of the contingency table
- `:attr2`- 2nd attribute of the contingency table
- `:attr3`, `attr4` - optional attributes
- `:population` - total population or total number of elements for a given combination of attributes

Returned by:
- manual input based on different data sources (e.g. census)

Mutated by:
- N/A

Used as argument by:
- `generate_joint_distributions`

`disaggregated_households`
_____

Description:
- The final DataFrame that represents the estimated population. Each row is one household with charactieristics, such as assigned individuals and coordinates of households' location.

Columns:
- `:id` - ID of a household
- `:hh_attr_id` - `:hh_attr_id` from DataFrame `households` that references a certain household type.
- `:head_id` - `:ind_attr_id` from DataFrame `individuals` that references head of household
- `:partner_id` - `:ind_attr_id` from DataFrame `individuals` that references partner of selected head, ==0 if not applicable
- `:child1_id` - `:ind_attr_id` from DataFrame `individuals` that references child in household, ==0 if not applicable
- `:child2_id` - `:ind_attr_id` from DataFrame `individuals` that references child in household, ==0 if not applicable
- `:child3_id` - `:ind_attr_id` from DataFrame `individuals` that references child in household, ==0 if not applicable
- `:lat` - latitude of the household location, ==0 if not assigned
- `:lon` - longitude of the household location, ==0 if not assigned
- `:area_id` - `:area_id` that references area from Dataframe `areas`, ==0 if not assigned

Returned by:
- `assign_individuals_to_households!`

Mutated by:
- `assign_areas_to_households!`

Used as argument by:
- `assign_areas_to_households!`

In [13]:
disaggregated_households

Row,id,hh_size,available,head_id,partner_id,child1_id,child2_id,child3_id,lat,lon
Unnamed: 0_level_1,Int64,Int64,Bool,Int64,Int64,Int64,Int64,Int64,Float64,Float64
1,1,1,false,516,0,0,0,0,39.8369,116.074
2,2,1,false,73,0,0,0,0,39.6054,116.236
3,3,1,false,201,0,0,0,0,39.902,116.41
4,4,1,false,125,0,0,0,0,39.87,116.249
5,5,1,false,463,0,0,0,0,39.8772,116.607
6,6,1,false,65,0,0,0,0,39.8556,116.177
7,7,1,false,347,0,0,0,0,39.757,115.783
8,8,1,false,133,0,0,0,0,39.9959,116.385
9,9,1,false,189,0,0,0,0,0.0,0.0
10,10,1,false,193,0,0,0,0,0.0,0.0


`individuals`
______

Description:
- Each row represents a unique combination of attributes of individuals

Columns:
- `:id` - ID that represents a unique combination of attributes of individuals
- different columns that represent the attributes of the individual, in our example these are:
    - `:maritial_status`
    - `:age`
    - `:sex`
    - `:income`

Returned by:
- `generate_joint_distributions`
- can be also generated manually

Mutated by:
- N/A

Used as argument by:
- `assign_individuals_to_households!`

In [14]:
individuals

Row,id,maritialstatus,sex,age,income,population
Unnamed: 0_level_1,Int64,String?,Char?,Int64?,Int64?,Int64
1,1,Not_married,M,22,25394,31
2,2,Married,M,22,25394,107
3,3,Divorced,M,22,25394,3
4,4,Widowed,M,22,25394,2
5,5,Not_married,F,22,25394,43
6,6,Married,F,22,25394,150
7,7,Divorced,F,22,25394,4
8,8,Widowed,F,22,25394,3
9,9,Not_married,M,27,25394,56
10,10,Married,M,27,25394,196


`aggregated_individuals`
________

Description:
- Each row represents number of individuals with unique combination of attributes represented by `:ind_attr_id`.

Columns:
- `:id` - `:id` from DataFrame `individuals` that references a unique combination of attributes of individuals
- `:population` - number of estimated individuals with combination of attribtues described by `:inf_attr_id`

Returned by:
- `generate_joint_distributions`

Mutated by:
- `assign_areas_to_households!`

Used as argument by:
- `assign_individuals_to_households!`

In [19]:
aggregated_individuals

Row,id,population
Unnamed: 0_level_1,Int64,Int64
1,1,2
2,2,5
3,3,0
4,4,0
5,5,0
6,6,9
7,7,0
8,8,0
9,9,22
10,10,8


`areas`
______

Description:
- Each row represents a unique area and its characteristics

Columns:
- `:id` - ID of an area
- additional data representing each area. In our case:
    - `:geometry` - `MultyPolygon` or `Polygon` object that represents the territory of a given area
    - `:name` - name of the area

Returned by: 
- function `generate_areas_dataframe` given input data downloaded from https://osm-boundaries.com/
- Can be also generated manually depending on data source

Mutated by:
- N/A

Used as argument by:
- assign_areas_to_households!

In [15]:
areas

Row,id,geometry,name
Unnamed: 0_level_1,Int64,Geometry,String
1,1,Polygon,Shijingshan District
2,2,Polygon,Haidian District
3,3,Polygon,Fengtai District
4,4,MultiPolygon,Chaoyang District
5,5,Polygon,Yanqing District
6,6,Polygon,Tongzhou District
7,7,Polygon,Shunyi District
8,8,Polygon,Pinggu District
9,9,Polygon,Miyun District
10,10,Polygon,Mentougou District


`aggregated_areas`
________________

Description:
- each row represents population for a given area

Columns:
- `:area_id` - `:area_id` from DataFrame `areas` that references a unique area
- `population` - population for a given area

Returned by:
- it is generated manually given DataFrame `areas` and additional input data (e.g. from census)

Mutated by:
- N/A

Used as argument by:
- `assign_areas_to_households!`

In [20]:
aggregated_areas

Row,id,population
Unnamed: 0_level_1,Int64,Float64
1,1,568.0
2,2,3132.0
3,3,2019.0
4,4,3451.0
5,5,346.0
6,6,1840.0
7,7,1324.0
8,8,457.0
9,9,528.0
10,10,393.0


`households`
_____________

Description:
- Each row represents a unique combination of attributes of household

Columns:
- `:hh_attr_id` - ID that represents a unique combination of attributes of individuals
- different columns that represent the attributes of the household, in our example these are:
    - `:hh_size`

Returned by:
- `generate_joint_distributions`
- can be also generated manually (like in our case)

Mutated by:
- N/A

In [133]:
households

Row,id,hh_size,population
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,2461
2,2,2,2724
3,3,3,1786
4,4,4,741
5,5,5,518


`aggregated_households`
___________________

Description:
- Each row represents number of households with unique combination of attributes represented by `:ind_hh_id`.

Columns:
- `:hh_attr_id` - `:hh_attr_id` from DataFrame `households` that references a certain household type
- `:quantity` - number of households with combination of attributes described by `:hh_attr_id`

Returned by:
- `generate_joint_distributions`
- can be also generated manually (like in our case)

Mutated by:
- `assign_individuals_to_households!`

Used as argument by:
- `assign_individuals_to_households!`

In [134]:
aggregated_households

Row,id,population
Unnamed: 0_level_1,Int64,Int64
1,2,0


### Optional output dataframes

`unassigned_individuals`
___________

Description: 
- Mutation of `aggegated_individuals`. Represents individuals that were not assigned to any of the households

Columns: 
- `ind_attr_id` - `ind_attr_id` from DataFrame `individuals` that represents unique combination of attribtues of individuals
- `:population` - number of individuals with combination of attributs `ind_attr_id` that were not assigned to any household

Returned by:
- `assign_individuals_to_households!` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [22]:
unassigned["unasssigned_individuals"]

LoadError: KeyError: key "unasssigned_individuals" not found

`unassigned_households`
___________

Description: 
- Mutation of `aggregated_households`. Represents households to whom no individuals were allocated.

Columns: 
- `:hh_attr_id` - `:hh_attr_id` from DataFrame `households` that represents unique combination of attribtues of household
- `:population` - number of households with combination of attributes `hh_attr_id` with no individuals assigned

Returned by:
- `assign_individuals_to_households!` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [24]:
unassigned["unassigned_households"]

LoadError: KeyError: key "unassigned_households" not found

`unassigned_areas`
____________

Description: 
- Mutation of `aggregated_areas`. Represents areas for whom estimated population does not match target population.
- Examples: 
    - if the target population for an area is 100, but the total number of individuals in households in this area is 98, the `:population` value is 2
    - if the target population for an area is 100, but the total number of individuals in households in this area is 105, the `:population` value is -5

Columns: 
- `:area_id` - `:area_id` from DataFrame `areas` that represents the areas for which locations of households are estimated
- `:population` - number of households with combination of attributes `hh_attr_id` that were not assigned any individuals

Returned by:
- `assign_areas_to_households!` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [26]:
unassigned["unassigned_areas"] #in our case all areas were assigned so dataframe is empty

Row,id,geometry,name,population,target_population,max_lon,min_lon,max_lat,min_lat
Unnamed: 0_level_1,Int64,Geometry,String,Float64,Int64,Float64,Float64,Float64,Float64


`disaggregated_unassigned_households`
________

Description:
- Represents households to whom coordinates were not assigned. 
- Filtered rows of `disaggregated_households` for which `:lat`, `:lon` are equal to 0 after the `assign_areas_to_households!` is performed.

Columns:
- same as in `disaggragated_households`

Returned by:
- `assign_areas_to_households!` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [27]:
unassigned["disaggregated_unassigned_households"]

Row,id,hh_size,available,head_id,partner_id,child1_id,child2_id,child3_id,lat,lon
Unnamed: 0_level_1,Int64,Int64,Bool,Int64,Int64,Int64,Int64,Int64,Float64,Float64
1,1,1,false,516,0,0,0,0,0.0,0.0
2,2,1,false,73,0,0,0,0,0.0,0.0
3,3,1,false,201,0,0,0,0,0.0,0.0
4,4,1,false,125,0,0,0,0,0.0,0.0
5,5,1,false,463,0,0,0,0,0.0,0.0
6,6,1,false,65,0,0,0,0,0.0,0.0
7,7,1,false,347,0,0,0,0,0.0,0.0
8,8,1,false,133,0,0,0,0,0.0,0.0
9,9,1,false,189,0,0,0,0,0.0,0.0
10,10,1,false,193,0,0,0,0,0.0,0.0
