# Generating Synthetic World Population using Walker's Alias method

In this notebook, I generate synthetic world populations based on statistic information of population per country from 2019. I used Walker's Alias method.<br><br>
First, I amend the code for Walker's Alias method (as it was originally written for Python 2), then create a synthetic world population and finally compare the two.



### Walker's Alias method implementation

Code for the Walkerradnom class amended from: <br>
https://code.activestate.com/recipes/576564-walkers-alias-method-for-random-objects-with-diffe/ <br>


In [1]:
from __future__ import division
import random

__author__ = "Maciej Tarsa, Denis Bzowy"
__version__ = "22may2021"

In [2]:
class Walkerrandom:
  """ Walker's alias method for random objects with different probablities
  """

  def __init__( self, weights, keys=None ):
    """ builds the Walker tables prob and inx for calls to random().
        The weights (a list or tuple or iterable) can be in any order;
        they need not sum to 1.
    """
    n = self.n = len(weights)
    self.keys = keys
    sumw = sum(weights)
    prob = [w * n / sumw for w in weights]  # av 1
    inx = [-1] * n
    short = [j for j, p in enumerate( prob ) if p < 1]
    long = [j for j, p in enumerate( prob ) if p > 1]
    while short and long:
        j = short.pop()
        k = long[-1]
        # assert prob[j] <= 1 <= prob[k]
        inx[j] = k
        prob[k] -= (1 - prob[j])  # -= residual weight
        if prob[k] < 1:
            short.append( k )
            long.pop()
    self.prob = prob
    self.inx = inx

  def __str__( self ):
    """ e.g. "Walkerrandom prob: 0.4 0.8 1 0.8  inx: 3 3 -1 2" """
    probstr = " ".join([ "%.2g" % x for x in self.prob ])
    inxstr = " ".join([ "%.2g" % x for x in self.inx ])
    return "Walkerrandom prob: %s  inx: %s" % (probstr, inxstr)

#...............................................................................
  def random( self ):
    """ each call -> a random int or key with the given probability
        fast: 1 randint(), 1 random.uniform(), table lookup
    """
    u = random.uniform( 0, 1 )
    j = random.randint( 0, self.n - 1 )  # or low bits of u
    randint = j if u <= self.prob[j] \
        else self.inx[j]
    return list(self.keys)[randint]

## Synthetic countries population

Data for country populations taken from:<br>
https://data.worldbank.org/indicator/SP.POP.TOTL


Some data preparation has been done using Excel - as most of it was easiest to do via the visual method:
- removal of region totals
- removal of first two rows containing metadata
- removal of all year fields apart from 2019
- removal of Eritrea, as it only contained data up to 2011

In [3]:
# import data into pandas
import pandas as pd
input = pd.read_csv('./population.csv')
input = input.rename(columns={"Country Name": "Country_name", "Country Code": "Country_code", "2019": "Real_pop"})
input.head()

Unnamed: 0,Country_name,Country_code,Real_pop
0,China,CHN,1397715000
1,India,IND,1366417754
2,United States,USA,328239523
3,Indonesia,IDN,270625568
4,Pakistan,PAK,216565318


In [4]:
# sum the total population
real_total = input['Real_pop'].sum()
print(real_total)

7656340905


In [5]:
# create a new row for percentage of total population
input['Real_perc'] = input['Real_pop']/real_total
input.head()

Unnamed: 0,Country_name,Country_code,Real_pop,Real_perc
0,China,CHN,1397715000,0.182557
1,India,IND,1366417754,0.178469
2,United States,USA,328239523,0.042872
3,Indonesia,IDN,270625568,0.035347
4,Pakistan,PAK,216565318,0.028286


Now I can use this data for generation of synthetic records using Walker's Alias method

In [6]:
# number of records to generate
# the same number as real records could be used
#Nrand = real_total
# or a smaller value, e.g. 1 million
Nrand = 1000000


from datetime import datetime
import time as t
start_ts = t.time()
now_start = datetime.now()
print(f"Starting generation on {now_start.strftime('%d/%m/%Y %H:%M:%S')}")
print()

print(Nrand, "Walkerrandom countries")

# set up the 'buckets' for walker's alias method
wrand = Walkerrandom(input['Real_pop'], input['Country_code'])
from collections import defaultdict
nrand = defaultdict(int)
# sample randomly from the distribution
for _ in range(Nrand):
  j = wrand.random()
  # sum the records for each country
  # here we could be creating a record for each generate 'person'
  nrand[j] += 1
  
# print the totals per country
s = str(sorted(nrand.items()))
print()
now_end = datetime.now()
print(f"Generation finished at {now_end.strftime('%d/%m/%Y %H:%M:%S')}")
print(f"Running time: {(t.time()-start_ts)/60} minutes")
print(s)


Starting generation on 22/05/2021 20:59:09

1000000 Walkerrandom countries

Generation finished at 22/05/2021 20:59:32
Running time: 0.37904158035914104 minutes
[('ABW', 12), ('AFG', 4948), ('AGO', 4160), ('ALB', 351), ('AND', 10), ('ARE', 1201), ('ARG', 5818), ('ARM', 382), ('ASM', 10), ('ATG', 13), ('AUS', 3285), ('AUT', 1130), ('AZE', 1240), ('BDI', 1546), ('BEL', 1476), ('BEN', 1496), ('BFA', 2725), ('BGD', 21372), ('BGR', 911), ('BHR', 219), ('BHS', 39), ('BIH', 444), ('BLR', 1207), ('BLZ', 46), ('BMU', 12), ('BOL', 1530), ('BRA', 27382), ('BRB', 40), ('BRN', 62), ('BTN', 107), ('BWA', 295), ('CAF', 612), ('CAN', 4984), ('CHE', 1103), ('CHI', 29), ('CHL', 2447), ('CHN', 182905), ('CIV', 3389), ('CMR', 3486), ('COD', 11463), ('COG', 765), ('COL', 6626), ('COM', 123), ('CPV', 73), ('CRI', 687), ('CSS', 997), ('CUB', 1480), ('CUW', 22), ('CYM', 12), ('CYP', 155), ('CZE', 1378), ('DEU', 10984), ('DJI', 139), ('DMA', 7), ('DNK', 749), ('DOM', 1411), ('DZA', 5586), ('ECU', 2246), ('EGY'

In [7]:
df = pd.DataFrame.from_dict(nrand, orient='index', columns=['Synth_pop'])
df['Country_code'] = df.index
output = pd.merge(df,input,left_on=['Country_code'], right_on = ['Country_code'], how = 'left')
synth_total = output['Synth_pop'].sum()
output['Synth_perc'] = output['Synth_pop']/synth_total
output['Diff'] = abs(output['Real_perc']-output['Synth_perc'])
output.head()

Unnamed: 0,Synth_pop,Country_code,Country_name,Real_pop,Real_perc,Synth_perc,Diff
0,178343,IND,India,1366417754,0.178469,0.178343,0.000126
1,14081,PHL,Philippines,108116615,0.014121,0.014081,4e-05
2,4984,CAN,Canada,37589262,0.00491,0.004984,7.4e-05
3,3486,CMR,Cameroon,25876380,0.00338,0.003486,0.000106
4,12505,VNM,Vietnam,96462106,0.012599,0.012505,9.4e-05


In [8]:
print(f"Smallest difference {output['Diff'].min()}")
print(f"Biggest difference {output['Diff'].max()}")
print(f"Average difference {output['Diff'].mean()}")

Smallest difference 7.2851075588307e-08
Biggest difference 0.000446408064269703
Average difference 3.3005597386725265e-05
