In [1]:
import pandas as pd
from ipfn import ipfn

## Geographical aggregates

We load, as an example, two data sets with age and sex aggregates per region. The data is borrowed from this book (free online, for R users): 
https://spatial-microsim-book.robinlovelace.net/what-is.html

In `SimpleWorld`, we have **33 individuals** and we know:

- Distribution per age and region
- Distribution per gender and region

In particular, we also know:
- Distribution per age
- Distribution per gender
- Distribution per region

We are interested in calculating:

- Distribution per (age, gender, region). 

This is done through `ipfn`

### 1. Read the data and turn it into "long" format.

In [2]:
age = pd.read_csv("./data/SimpleWorld/age.csv")
gender = pd.read_csv("./data/SimpleWorld/sex.csv")

In [3]:
age['region'] = age.index
gender['region'] = gender.index

In [4]:
age

Unnamed: 0,a.50-,a.50+,region
0,8,4,0
1,2,8,1
2,7,4,2


In [5]:
age = pd.melt(age, id_vars=['region'], value_vars=['a.50-', 'a.50+'], value_name='total', var_name='age')

In [6]:
gender = pd.melt(gender, id_vars=['region'], value_vars=['m', 'f'], value_name='total', var_name='gender')

In [7]:
age

Unnamed: 0,region,age,total
0,0,a.50-,8
1,1,a.50-,2
2,2,a.50-,7
3,0,a.50+,4
4,1,a.50+,8
5,2,a.50+,4


### 2. Calculate marginals

In [8]:
age_marginal = age.groupby('age')['total'].sum()
gender_marginal = gender.groupby('gender')['total'].sum()
region_marginal = age.groupby('region')['total'].sum()
region_age_marginal = age.groupby(['age','region'])['total'].sum()
region_gender_marginal = gender.groupby(['region', 'gender'])['total'].sum()

In [9]:
age_marginal

age
a.50+    16
a.50-    17
Name: total, dtype: int64

In [10]:
aggregates = [age_marginal, region_marginal, gender_marginal,  region_age_marginal, region_gender_marginal]
dimensions = [['age'], ['region'], ['gender'],  ['age','region'], ['region', 'gender']]

In [11]:
dimensions

[['age'], ['region'], ['gender'], ['age', 'region'], ['region', 'gender']]

### 3. Create a "placeholder" dataframe that will have the desired structure.

In [12]:
res = pd.DataFrame(columns=['age', 'gender', 'region','total'])
for r in age['region'].unique():
    for a in age['age'].unique():
        for g in gender['gender'].unique():
            row = pd.DataFrame.from_dict({'age':[a], 'gender':[g], 'region':[r], 'total':[1]})
            res = res.append(row)

In [13]:
res

Unnamed: 0,age,gender,region,total
0,a.50-,m,0,1
0,a.50-,f,0,1
0,a.50+,m,0,1
0,a.50+,f,0,1
0,a.50-,m,1,1
0,a.50-,f,1,1
0,a.50+,m,1,1
0,a.50+,f,1,1
0,a.50-,m,2,1
0,a.50-,f,2,1


In [14]:
IPF = ipfn.ipfn(res, aggregates, dimensions)

In [15]:
res = IPF.iteration()

ipfn converged: convergence_rate not updating or below rate_tolerance


In [16]:
res

Unnamed: 0,region,gender,age,total
0,0,f,a.50-,2.0
1,0,f,a.50+,4.0
2,0,m,a.50-,2.0
3,0,m,a.50+,4.0
4,1,f,a.50-,4.8
5,1,f,a.50+,1.2
6,1,m,a.50-,3.2
7,1,m,a.50+,0.8
8,2,f,a.50-,2.90909
9,2,f,a.50+,5.09091


In [17]:
res['total'].sum() # This should give 33 individuals = size of the population

33.0

### 4. Estimating income per region

Based on the results from a sample data (let's say, a survey) and using the synthetic population that was generated above, we can estimate income per region.

In [18]:
survey = pd.read_csv('./data/SimpleWorld/ind-full.csv')
survey

Unnamed: 0,id,age,gender,region,income
0,1,a.50+,m,0,2868
1,2,a.50+,m,1,2474
2,3,a.50-,m,2,2231
3,4,a.50+,f,0,3152
4,5,a.50-,f,1,2473


In [19]:
out = pd.merge(
    survey, res, 
    how='left', 
    left_on=['age', 'region', 'gender'], 
    right_on = ['age', 'region', 'gender'], 
)

In [20]:
out['total_income'] = out['total']*out['income']

In [21]:
out.groupby('region')['total_income'].sum()

region
0    24080.000000
1    13849.600000
2     2433.818182
Name: total_income, dtype: float64