## Data generation

#### Include needed files
The data and functiones are not explained here - detailed explanation is provided in `demo.ipynb`. This notebook only executes the code so that tables can be descirbed in further part.

In [6]:
include("../src/SyntheticPopulation.jl")

SCALE = 0.001 
popoulation_size = 21890000

marginal_ind_age_sex = DataFrame(
    sex = repeat(['M', 'F'], 18),
    age = repeat(2:5:87, inner = 2), 
    population = SCALE .* 10000 .* [52.6, 49.0, 48.5, 44.8, 33.6, 30.6, 34.6, 28.8, 71.6, 63.4, 99.6, 90.9, 130.9, 119.4, 110.8, 103.5, 83.8, 76.4, 84.2, 77.7, 84.2, 77.8, 82.8, 79.9, 67.7, 71.0, 56.9, 62.6, 31.5, 35.3, 18.5, 23.0, 15.2, 19.7, 12.5, 16.0]
    )
marginal_ind_sex_maritalstatus = DataFrame(
    sex = repeat(['M', 'F'], 4), 
    maritalstatus = repeat(["Not_married", "Married", "Divorced", "Widowed"], inner = 2), 
    population = SCALE .* [1679, 1611, 5859, 5774, 140, 206, 128, 426] ./ 0.00082
    )
marginal_ind_income = DataFrame(
    income = [25394, 44855, 63969, 88026, 145915], 
    population = repeat([popoulation_size * SCALE / 5], 5)
    )
marginal_hh_size = DataFrame(
    hh_size = [1,2,3,4,5],
    population = Int.(round.(SCALE * 8230000 .* [0.299, 0.331, 0.217, 0.09, 0.063]))
    )

#areas
URL = "https://osm-boundaries.com/Download/Submit?apiKey=87100809b4085adb58139419c141e5a1&db=osm20230102&osmIds=-2988894,-2988933,-2988895,-288600,-2988896,-2988946,-5505984,-2988897,-2988898,-2988899,-2988900,-5505985,-2988901,-2988902,-568660,-2988903&format=GeoJSON&srid=4326"
areas = generate_areas_dataframe(URL)
aggregated_areas = copy(areas)
aggregated_areas.:population = Int.(round.(SCALE .* 10000 .* [56.8, 313.2, 201.9, 345.1, 34.6, 184.0, 132.4, 45.7, 52.8, 39.3, 44.1, 131.3, 199.4, 226.9, 110.6, 70.9]))

#households and individuals
aggregated_individuals = generate_joint_distributions(marginal_ind_sex_maritalstatus, marginal_ind_age_sex, marginal_ind_income, config_file = "config_file.json")
aggregated_households = generate_joint_distributions(marginal_hh_size)

#Allocation
disaggregated_households, unassigned1 = assign_individuals_to_households(aggregated_individuals, aggregated_households, return_unassigned = true)
disaggregated_households, unassigned2 = assign_areas_to_households!(disaggregated_households, aggregated_areas, return_unassigned = true)

nothing

Downloading file... 
File downloaded. Unzipping file...
File saved at /Users/marcinzurek/Desktop/Studia/Research/SyntheticPopulation/notebooks/file.geojson


[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [9777, 10698]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [9520, 11195]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInconsistent target margins, converting `X` and `mar` to proportions. Margin totals: [21893, 21890]
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mConverged in 1 iterations.


aggregated_joint_distribution = 568×5 DataFrame
 Row │ maritalstatus  sex    age     income   population
     │ String?         Char?  Int64?  Int64?   Int64
─────┼────────────────────────────────────────────────────
   1 │ missing         F           2  missing         495
   2 │ missing         M           2  missing         530
   3 │ missing         F           7  missing         450
   4 │ missing         M           7  missing         490
   5 │ missing         F          12  missing         305
   6 │ missing         M          12  missing         330
   7 │ missing         F          17  missing         285
   8 │ missing         M          17  missing         345
   9 │ Divorced        F          22    25394           3
  10 │ Married         F          22    25394          91
  11 │ Not_married     F          22    25394          25
  12 │ Widowed         F          22    25394           7
  13 │ Divorced        M          22    25394           3
  14 │ Married         M     

 568 │ Widowed         M          87   145915           0
Total number of individuals: 21890
Total number of households: 8230
Allocation started... 


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:05[39m



---------------
There are no available children! 
---------------
Allocated 82.0% individuals.
Allocated 95.0% households.
Assigning coordinates to households...


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:25[39m


## Input and output data format

### Naming convention:
- `aggregated_dataframe` -  unique combinations of dataframe attributes with columns `:id`, multiple columns for each `attribute`, and column `:population`
- `disaggregated_dataframe` - disaggregated population.
- `unassigned_dataframe` - aggregated population with columns `:id`, multiple columns for each `attribute`, and column `:population`
- `marginal_attr1_attr2` - flattened contingency tables with marginal distributions

### Dataframes:

### `marginal_attr1_attr2`
___________

Description:
- flattened multi-way contingency table with marginal distributions of the variables
- example: `marginal_age_sex`

Columns:
- `:attr1` - 1st attribute of the contingency table
- `:attr2`- 2nd attribute of the contingency table
- `:attr3`, `attr4` - optional attributes
- `:population` - total population or total number of elements for a given combination of attributes

Returned by:
- manual input based on different data sources (e.g. census)

Mutated by:
- N/A

Used as argument by:
- `generate_joint_distributions`

In [7]:
marginal_ind_age_sex

Row,sex,age,population
Unnamed: 0_level_1,Char,Int64,Int64
1,F,2,490
2,M,2,526
3,F,7,448
4,M,7,485
5,F,12,306
6,M,12,336
7,F,17,288
8,M,17,346
9,F,22,634
10,M,22,716


### `disaggregated_households`
_____

Description:
- The final DataFrame that represents the estimated population. Each row is one household with its charactieristics, such as assigned individuals and coordinates of households' location.

Columns:
- `:id` - ID of a household
- `:hh_attr_id` - `:id` from DataFrame `aggregated_households` that references a certain household type.
- `:individuals_allocated` – a Boolean flag set to `true` if the individuals were allocated to the household
- `:head_id` - `:id` from DataFrame `aggregated_individuals` that references head of household
- `:partner_id` - `:id` from DataFrame `aggregated_individuals` that references partner of selected head, ==0 if not applicable
- `:child1_id` - `:id` from DataFrame `aggregated_individuals` that references child in household, ==0 if not applicable
- `:child2_id` - `:id` from DataFrame `aggregated_individuals` that references child in household, ==0 if not applicable
- `:child3_id` - `:id` from DataFrame `aggregated_individuals` that references child in household, ==0 if not applicable
- `:lat` - latitude of the household location, ==0 if not assigned
- `:lon` - longitude of the household location, ==0 if not assigned
- `:area_id` - `:area_id` that references area from Dataframe `aggregated_areas`, ==0 if not assigned

Returned by:
- `assign_individuals_to_households`

Mutated by:
- `assign_areas_to_households!`

Used as argument by:
- `assign_areas_to_households!`

In [53]:
disaggregated_households

Row,id,hh_attr_id,individuals_allocated,head_id,partner_id,child1_id,child2_id,child3_id,lat,lon,area_id
Unnamed: 0_level_1,Int64,Int64,Bool,Int64,Int64,Int64,Int64,Int64,Float64,Float64,Int64
1,1,1,true,531,0,0,0,0,40.0466,116.538,4
2,2,1,true,19,0,0,0,0,39.9333,115.797,10
3,3,1,true,223,0,0,0,0,39.8258,116.2,3
4,4,1,true,419,0,0,0,0,40.3675,117.164,8
5,5,1,true,217,0,0,0,0,39.9353,116.552,4
6,6,1,true,71,0,0,0,0,39.9341,116.304,2
7,7,1,true,131,0,0,0,0,40.2924,116.068,14
8,8,1,true,504,0,0,0,0,39.8775,115.817,12
9,9,1,true,243,0,0,0,0,39.9791,116.279,2
10,10,1,true,243,0,0,0,0,40.111,116.091,2


### `aggregated_individuals`
______

Description:
- Each row represents a unique combination of attributes of individuals

Columns:
- `:id` - ID that represents a unique combination of attributes of individuals
- different columns that represent the attributes of the individual, in our example these are:
    - `:marital_status`
    - `:age`
    - `:sex`
    - `:income`
- `:population` - number of estimated individuals with combination of attribtues described by `:id`


Returned by:
- `generate_joint_distributions`

Mutated by:
- N/A

Used as argument by:
- `assign_individuals_to_households`

In [15]:
aggregated_individuals

Row,id,maritalstatus,sex,age,income,population
Unnamed: 0_level_1,Int64,String?,Char?,Int64?,Int64?,Int64
1,1,missing,F,2,missing,495
2,2,missing,M,2,missing,530
3,3,missing,F,7,missing,450
4,4,missing,M,7,missing,490
5,5,missing,F,12,missing,305
6,6,missing,M,12,missing,330
7,7,missing,F,17,missing,285
8,8,missing,M,17,missing,345
9,9,Divorced,F,22,25394,3
10,10,Married,F,22,25394,91


### `aggregated_areas`
______

Description:
- Each row represents a unique area and its characteristics

Columns:
- `:id` - ID of an area
- additional data representing each area. In our case:
    - `:geometry` - `MultyPolygon` or `Polygon` object that represents the territory of a given area
    - `:name` - name of the area
- `:population` - population for a given area


Returned by: 
- function `generate_areas_dataframe` given input data downloaded from https://osm-boundaries.com/ to obtain all columns except for `:population`
- `:population` column to be provided manually given available data
- Can be also generated manually depending on data source

Mutated by:
- N/A

Used as argument by:
- `assign_areas_to_households!`

In [16]:
aggregated_areas

Row,id,geometry,name,population
Unnamed: 0_level_1,Int64,Geometry,String,Int64
1,1,Polygon,Shijingshan District,568
2,2,Polygon,Haidian District,3132
3,3,Polygon,Fengtai District,2019
4,4,MultiPolygon,Chaoyang District,3451
5,5,Polygon,Yanqing District,346
6,6,Polygon,Tongzhou District,1840
7,7,Polygon,Shunyi District,1324
8,8,Polygon,Pinggu District,457
9,9,Polygon,Miyun District,528
10,10,Polygon,Mentougou District,393


### `aggregated_households`
_____________

Description:
- Each row represents a unique combination of attributes of household

Columns:
- `id` - ID that represents a unique combination of attributes of individuals
- different columns that represent the attributes of the household, in our example these are:
    - `:hh_size`
- `:population` - number of households with combination of attributes described by `:id`

Returned by:
- `generate_joint_distributions`
- can be also generated manually

Mutated by:
- N/A

Used as argument by:
- `assign_individuals_to_households`

In [17]:
aggregated_households

Row,id,hh_size,population
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,2461
2,2,2,2724
3,3,3,1786
4,4,4,741
5,5,5,518


### Optional output dataframes

### `unassigned_individuals`
___________

Description: 
- Mutation of `aggegated_individuals`. Represents individuals that were not assigned to any of the households

Columns: 
- `:id` - `:id` that represents unique combination of attribtues of individuals
- different columns that represent the attributes of the individual, in our example these are:
    - `:marital_status`
    - `:age`
    - `:sex`
    - `:income`
- `:population` - number of individuals with combination of attributes with `:id` that were not assigned to any household

Returned by:
- `assign_individuals_to_households` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [18]:
unassigned1["unassigned_individuals"]

Row,id,maritalstatus,sex,age,income,population
Unnamed: 0_level_1,Int64,String?,Char?,Int64?,Int64?,Int64
1,17,Divorced,F,27,25394,1
2,18,Married,F,27,25394,9
3,19,Not_married,F,27,25394,20
4,20,Widowed,F,27,25394,6
5,21,Divorced,M,27,25394,2
6,22,Married,M,27,25394,19
7,23,Not_married,M,27,25394,18
8,24,Widowed,M,27,25394,2
9,25,Divorced,F,32,25394,1
10,26,Married,F,32,25394,14


### `unassigned_households`
___________

Description: 
- Mutation of `aggregated_households`. Represents households to whom no individuals were allocated.

Columns: 
- `:id` - `:id` from DataFrame `households` that represents unique combination of attribtues of household
- different columns that represent the attributes of the household, in our example these are:
    - `:hh_size`
- `:population` - number of households with combination of attributes represented by `:id` with no individuals assigned

Returned by:
- `assign_individuals_to_households` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [19]:
unassigned1["unassigned_households"] #in our case all areas were assigned so dataframe is empty

Row,id,hh_size,population
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,104
2,2,2,116
3,3,3,81
4,4,4,45
5,5,5,25


### `unassigned_areas`
____________

Description: 
- Mutation of `aggregated_areas`. Represents areas for whom some slots for population were not filled

Columns: 
- `:id` - area ID
- `:population` - number of households with combination of attributes represented by `:id` that were not assigned any individuals

Returned by:
- `assign_areas_to_households!` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [20]:
unassigned2["unassigned_areas"] #in our case 14 people (target population) were not assigned 

Row,id,geometry,name,population
Unnamed: 0_level_1,Int64,Geometry,String,Int64
1,1,Polygon,Shijingshan District,92
2,2,Polygon,Haidian District,600
3,3,Polygon,Fengtai District,317
4,4,MultiPolygon,Chaoyang District,673
5,5,Polygon,Yanqing District,64
6,6,Polygon,Tongzhou District,325
7,7,Polygon,Shunyi District,243
8,8,Polygon,Pinggu District,77
9,9,Polygon,Miyun District,105
10,10,Polygon,Mentougou District,65


### `disaggregated_unassigned_households`
________

Description:
- Represents households to whom coordinates were not assigned. 
- Filtered rows of `disaggregated_households` for which `:lat`, `:lon` are equal to 0 after the `assign_areas_to_households!` is performed.

Columns:
- same as in `disaggragated_households`

Returned by:
- `assign_areas_to_households!` if optional parameter `return_unassigned` is set to `true`

Mutated by:
- N/A

Used as argument by:
- N/A

In [21]:
unassigned2["disaggregated_unassigned_households"] #in our case all households were assigned so dataframe is empty

Row,id,hh_attr_id,hh_size,individuals_allocated,head_id,partner_id,child-1_id,child0_id,child1_id,child2_id,child3_id,lat,lon,area_id
Unnamed: 0_level_1,Int64,Int64,Int64,Bool,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Float64,Float64,Int64
