## Data generation

#### Include needed files
The data and functiones are not explained here - detailed explanation is provided in `demo.ipynb`. This notebook only executes the code so that tables can be descirbed in further part.

In [13]:
include("src/SyntheticPopulation.jl")

SCALE = 0.001 
popoulation_size = 21890000

marginal_ind_age_sex = DataFrame(
    sex = repeat(['M', 'F'], 18),
    age = repeat(2:5:87, inner = 2), 
    population = SCALE .* 10000 .* [52.6, 49.0, 48.5, 44.8, 33.6, 30.6, 34.6, 28.8, 71.6, 63.4, 99.6, 90.9, 130.9, 119.4, 110.8, 103.5, 83.8, 76.4, 84.2, 77.7, 84.2, 77.8, 82.8, 79.9, 67.7, 71.0, 56.9, 62.6, 31.5, 35.3, 18.5, 23.0, 15.2, 19.7, 12.5, 16.0]
    )
marginal_ind_sex_maritialstatus = DataFrame(
    sex = repeat(['M', 'F'], 4), 
    maritialstatus = repeat(["Not_married", "Married", "Divorced", "Widowed"], inner = 2), 
    population = SCALE .* [1679, 1611, 5859, 5774, 140, 206, 128, 426] ./ 0.00082
    )
marginal_ind_income = DataFrame(
    income = [25394, 44855, 63969, 88026, 145915], 
    population = repeat([popoulation_size * SCALE / 5], 5)
    )
marginal_hh_size = DataFrame(
    hh_size = [1,2,3,4,5],
    population = Int.(round.(SCALE * 8230000 .* [0.299, 0.331, 0.217, 0.09, 0.063]))
    )

#areas
URL = "https://osm-boundaries.com/Download/Submit?apiKey=87100809b4085adb58139419c141e5a1&db=osm20230102&osmIds=-2988894,-2988933,-2988895,-288600,-2988896,-2988946,-5505984,-2988897,-2988898,-2988899,-2988900,-5505985,-2988901,-2988902,-568660,-2988903&format=GeoJSON&srid=4326"
areas = generate_areas_dataframe(URL)
aggregated_areas = copy(areas)
aggregated_areas.:population = Int.(round.(SCALE .* 10000 .* [56.8, 313.2, 201.9, 345.1, 34.6, 184.0, 132.4, 45.7, 52.8, 39.3, 44.1, 131.3, 199.4, 226.9, 110.6, 70.9]))
aggregated_areas = aggregated_areas[:, [:id, :population]]

#households and individuals
individuals, aggregated_individuals = generate_joint_distributions(marginal_ind_sex_maritialstatus, marginal_ind_age_sex, marginal_ind_income, config_file = "config_file.json")
households, aggregated_households = generate_joint_distributions(marginal_hh_size)

#Allocation
disaggregated_households, unassigned1 = assign_individuals_to_households(individuals, aggregated_individuals, households, aggregated_households, return_unassigned = true)
disaggregated_households, unassigned2 = assign_areas_to_households!(disaggregated_households, areas, aggregated_areas, return_unassigned = true)

nothing

Downloading file... 
File downloaded. Unzipping file...
File saved at /Users/marcinzurek/Desktop/Studia/Research/SyntheticPopulation/file.geojson


[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [9520, 9502]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [9777, 9166]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [18668, 21890]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.


Total_number of individuals: 21885
Total_number of households: 8230
Allocation started... 


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:06[39m


Allocated 86.0% individuals.
Allocated 100.0% households.
Assigning coordinates to households...


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:26[39m


## Input and output data format

### Naming convention:
- `dataframe` - unique combinations of dataframe attributes with columns `:id` and multiple columns for each `attribute`.
- `aggregated_dataframe` - aggregated population with columns `:id` and `:population`.
- `disaggregated_dataframe` - disaggregated population.
- `unassigned_dataframe` - aggregated population with columns `:id` and `:unassigned_population`.
- `marginal_attr1_attr2` - flattened contingency tables with marginal distributions

### Dataframes:

`marginal_attr1_attr2`
___________

Description:
- flattened multi-way contingency table with marginal distributions of the variables
- example: `marginal_age_sex`

Columns:
- `:attr1` - 1st attribute of the contingency table
- `:attr2`- 2nd attribute of the contingency table
- `:attr3`, `attr4` - optional attributes
- `:population` - total population or total number of elements for a given combination of attributes

Returned by:
- manual input based on different data sources (e.g. census)

Mutated by:
- N/A

Used as argument by:
- `generate_joint_distributions`

`disaggregated_households`
_____

Description:
- The final DataFrame that represents the estimated population. Each row is one household with charactieristics, such as assigned individuals and coordinates of households' location.

Columns:
- `:id` - ID of a household
- `:hh_attr_id` - `:hh_attr_id` from DataFrame `households` that references a certain household type.
- `:head_id` - `:id` from DataFrame `individuals` that references head of household
- `:partner_id` - `:id` from DataFrame `individuals` that references partner of selected head, ==0 if not applicable
- `:child1_id` - `:id` from DataFrame `individuals` that references child in household, ==0 if not applicable
- `:child2_id` - `:id` from DataFrame `individuals` that references child in household, ==0 if not applicable
- `:child3_id` - `:id` from DataFrame `individuals` that references child in household, ==0 if not applicable
- `:lat` - latitude of the household location, ==0 if not assigned
- `:lon` - longitude of the household location, ==0 if not assigned
- `:area_id` - `:area_id` that references area from Dataframe `areas`, ==0 if not assigned

Returned by:
- `assign_individuals_to_households`

Mutated by:
- `assign_areas_to_households!`

Used as argument by:
- `assign_areas_to_households!`

In [14]:
disaggregated_households

Row,id,hh_attr_id,individuals_allocated,head_id,partner_id,child1_id,child2_id,child3_id,lat,lon,area_id
Unnamed: 0_level_1,Int64,Int64,Bool,Int64,Int64,Int64,Int64,Int64,Float64,Float64,Float64
1,1,1,true,269,0,0,0,0,39.6711,116.858,6.0
2,2,1,true,349,0,0,0,0,39.9374,116.409,16.0
3,3,1,true,21,0,0,0,0,40.0,116.373,4.0
4,4,1,true,485,0,0,0,0,40.1486,116.096,14.0
5,5,1,true,425,0,0,0,0,40.2181,116.235,14.0
6,6,1,true,129,0,0,0,0,39.8842,116.487,4.0
7,7,1,true,473,0,0,0,0,40.1584,117.255,8.0
8,8,1,true,105,0,0,0,0,40.4196,116.49,11.0
9,9,1,true,489,0,0,0,0,40.1149,116.762,7.0
10,10,1,true,197,0,0,0,0,39.7814,116.412,13.0


`individuals`
______

Description:
- Each row represents a unique combination of attributes of individuals

Columns:
- `:id` - ID that represents a unique combination of attributes of individuals
- different columns that represent the attributes of the individual, in our example these are:
    - `:maritial_status`
    - `:age`
    - `:sex`
    - `:income`

Returned by:
- `generate_joint_distributions`

Mutated by:
- N/A

Used as argument by:
- `assign_individuals_to_households`

In [15]:
individuals

Row,id,maritialstatus,sex,age,income
Unnamed: 0_level_1,Int64,String?,Char?,Int64?,Int64?
1,1,Not_married,M,22,25394
2,2,Married,M,22,25394
3,3,Divorced,M,22,25394
4,4,Widowed,M,22,25394
5,5,Not_married,F,22,25394
6,6,Married,F,22,25394
7,7,Divorced,F,22,25394
8,8,Widowed,F,22,25394
9,9,Not_married,M,27,25394
10,10,Married,M,27,25394


`aggregated_individuals`
________

Description:
- Each row represents number of individuals with unique combination of attributes represented by `:id`.

Columns:
- `:id` - `:id` from DataFrame `individuals` that references a unique combination of attributes of individuals
- `:population` - number of estimated individuals with combination of attribtues described by `:id`

Returned by:
- `generate_joint_distributions`

Mutated by:
- N/A

Used as argument by:
- `assign_individuals_to_households`

In [16]:
aggregated_individuals

Row,id,population
Unnamed: 0_level_1,Int64,Int64
1,1,31
2,2,107
3,3,3
4,4,2
5,5,43
6,6,150
7,7,4
8,8,3
9,9,56
10,10,196


`areas`
______

Description:
- Each row represents a unique area and its characteristics

Columns:
- `:id` - ID of an area
- additional data representing each area. In our case:
    - `:geometry` - `MultyPolygon` or `Polygon` object that represents the territory of a given area
    - `:name` - name of the area

Returned by: 
- function `generate_areas_dataframe` given input data downloaded from https://osm-boundaries.com/
- Can be also generated manually depending on data source

Mutated by:
- N/A

Used as argument by:
- `assign_areas_to_households!`

In [17]:
areas

Row,id,geometry,name
Unnamed: 0_level_1,Int64,Geometry,String
1,1,Polygon,Shijingshan District
2,2,Polygon,Haidian District
3,3,Polygon,Fengtai District
4,4,MultiPolygon,Chaoyang District
5,5,Polygon,Yanqing District
6,6,Polygon,Tongzhou District
7,7,Polygon,Shunyi District
8,8,Polygon,Pinggu District
9,9,Polygon,Miyun District
10,10,Polygon,Mentougou District


`aggregated_areas`
________________

Description:
- each row represents population for a given area

Columns:
- `:id` - `:id` from DataFrame `areas` that references a unique area
- `population` - population for a given area

Returned by:
- it is generated manually given DataFrame `areas` and additional input data (e.g. from census)

Mutated by:
- N/A

Used as argument by:
- `assign_areas_to_households!`

In [18]:
aggregated_areas

Row,id,population
Unnamed: 0_level_1,Int64,Int64
1,1,568
2,2,3132
3,3,2019
4,4,3451
5,5,346
6,6,1840
7,7,1324
8,8,457
9,9,528
10,10,393


`households`
_____________

Description:
- Each row represents a unique combination of attributes of household

Columns:
- `id` - ID that represents a unique combination of attributes of individuals
- different columns that represent the attributes of the household, in our example these are:
    - `:hh_size`

Returned by:
- `generate_joint_distributions`

Mutated by:
- N/A

Used as argument by:
- `assign_individuals_to_households`

In [19]:
households

Row,id,hh_size
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3
4,4,4
5,5,5


`aggregated_households`
___________________

Description:
- Each row represents number of households with unique combination of attributes represented by `:id`.

Columns:
- `:id` - `:id` from DataFrame `households` that references a certain household type
- `:population` - number of households with combination of attributes described by `:id`

Returned by:
- `generate_joint_distributions`
- can be also generated manually (like in our case)

Mutated by:
- N/A

Used as argument by:
- `assign_individuals_to_households`

In [20]:
aggregated_households

Row,id,population
Unnamed: 0_level_1,Int64,Int64
1,1,2461
2,2,2724
3,3,1786
4,4,741
5,5,518


### Optional output dataframes

`unassigned_individuals`
___________

Description: 
- Mutation of `aggegated_individuals`. Represents individuals that were not assigned to any of the households

Columns: 
- `:id` - `:id` from DataFrame `individuals` that represents unique combination of attribtues of individuals
- `:population` - number of individuals with combination of attributes with `:id` that were not assigned to any household

Returned by:
- `assign_individuals_to_households` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [21]:
unassigned1["unassigned_individuals"]

Row,id,population
Unnamed: 0_level_1,Int64,Int64
1,2,3
2,5,1
3,6,3
4,9,21
5,10,8
6,11,3
7,12,2
8,13,23
9,14,3
10,15,2


`unassigned_households`
___________

Description: 
- Mutation of `aggregated_households`. Represents households to whom no individuals were allocated.

Columns: 
- `:id` - `:id` from DataFrame `households` that represents unique combination of attribtues of household
- `:population` - number of households with combination of attributes represented by `:id` with no individuals assigned

Returned by:
- `assign_individuals_to_households` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [22]:
unassigned1["unassigned_households"] #in our case all areas were assigned so dataframe is empty

Row,id,population
Unnamed: 0_level_1,Int64,Int64


`unassigned_areas`
____________

Description: 
- Mutation of `aggregated_areas`. Represents areas for whom estimated population does not match target population.
- Examples: 
    - if the target population for an area is 100, but the total number of individuals in households in this area is 98, the `:population` value is 2
    - if the target population for an area is 100, but the total number of individuals in households in this area is 105, the `:population` value is -5

Columns: 
- `:id` - `:id` from DataFrame `areas` that represents the areas for which locations of households are estimated
- `:population` - number of households with combination of attributes represented by `:id` that were not assigned any individuals

Returned by:
- `assign_areas_to_households!` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [23]:
unassigned2["unassigned_areas"] #in our case 14 people (target population) were not assigned 

Row,id,geometry,name,population,target_population,max_lon,min_lon,max_lat,min_lat
Unnamed: 0_level_1,Int64,Geometry,String,Int64,Int64,Float64,Float64,Float64,Float64
1,4,MultiPolygon,Chaoyang District,3451,19,116.639,116.345,40.1101,39.8083


`disaggregated_unassigned_households`
________

Description:
- Represents households to whom coordinates were not assigned. 
- Filtered rows of `disaggregated_households` for which `:lat`, `:lon` are equal to 0 after the `assign_areas_to_households!` is performed.

Columns:
- same as in `disaggragated_households`

Returned by:
- `assign_areas_to_households!` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [24]:
unassigned2["disaggregated_unassigned_households"] #in our case all households were assigned so dataframe is empty

Row,id,hh_attr_id,hh_size,individuals_allocated,head_id,partner_id,child1_id,child2_id,child3_id,lat,lon,area_id
Unnamed: 0_level_1,Int64,Int64,Int64,Bool,Int64,Int64,Int64,Int64,Int64,Float64,Float64,Int64
