# Population Weighting

This notebook extends the times by weighting by population of each SED.

## Setup

In [1]:
import numpy as np
import pandas as pd
import os

In [2]:
os.chdir("../submission/")

The travel time data:

In [3]:
travel_time_data = pd.read_pickle("data/preprocessed/travel_times_edit.pkl")

In [4]:
travel_time_data[:1]

Unnamed: 0.1,Unnamed: 0,closest_destination,closest_time_to_arrival,origin_latitude,origin_longitude,second_closest_destination,second_closest_time_to_arrival,sed_id,sed_name,type,time
0,0,The Alfred,498.0,-37.83541,144.969078,St Vincent's Hospital,1227.0,SED20106,Albert Park (Southern Metropolitan),metropolitan,09:00


The travel time data subset by SED:

In [5]:
len(travel_time_data) / 50

88.0

In [6]:
len(travel_time_data[travel_time_data.sed_id == "SED20106"])

50

Ok, now we just need to work on selecting the SED by population weighting.

## Data Source

Data was sourced from the [ABS Census Data Packs (2016)](https://datapacks.censusdata.abs.gov.au/datapacks/).

The data selected was General Community Profile data for State Electoral Divisions (as was used to generate location data from Google Maps.

We only require one of the data files, `2016Census_G01_VIC_SED.csv`, which contains the field `Tot_P_P` i.e. the total population of the SED. This data file has been moved directly to the `submission/data/external/` directory.

In [7]:
population_data = pd.read_csv("data/external/2016Census_G01_VIC_SED.csv")

In [8]:
population_data

Unnamed: 0,SED_CODE_2016,Tot_P_M,Tot_P_F,Tot_P_P,Age_0_4_yr_M,Age_0_4_yr_F,Age_0_4_yr_P,Age_5_14_yr_M,Age_5_14_yr_F,Age_5_14_yr_P,...,High_yr_schl_comp_Yr_8_belw_P,High_yr_schl_comp_D_n_g_sch_M,High_yr_schl_comp_D_n_g_sch_F,High_yr_schl_comp_D_n_g_sch_P,Count_psns_occ_priv_dwgs_M,Count_psns_occ_priv_dwgs_F,Count_psns_occ_priv_dwgs_P,Count_Persons_other_dwgs_M,Count_Persons_other_dwgs_F,Count_Persons_other_dwgs_P
0,SED20106,37542,38414,75951,1680,1698,3380,2197,2120,4314,...,1566,145,211,355,31866,33295,65160,7385,6939,14328
1,SED20207,45554,46137,91695,4267,4009,8276,6682,6268,12955,...,3170,298,410,705,42931,43556,86484,2099,2134,4234
2,SED20302,35313,37386,72701,2720,2560,5277,4992,4864,9859,...,3434,128,142,273,32152,33998,66153,2417,2776,5196
3,SED20401,30328,31860,62184,2138,2063,4205,3621,3388,7013,...,2269,153,178,335,28310,29481,57789,1476,1879,3353
4,SED20508,28565,30653,59222,1759,1596,3357,3768,3605,7376,...,2496,77,56,130,25700,27668,53366,1648,1911,3560
5,SED20604,31681,32532,64222,2025,1963,3994,4244,4147,8392,...,2851,108,104,207,27951,29492,57438,3045,2440,5485
6,SED20704,31097,33147,64246,2149,2061,4211,4556,4127,8687,...,2806,115,116,225,28338,30061,58398,2117,2551,4665
7,SED20804,29430,30686,60122,1792,1762,3554,3602,3365,6966,...,3125,146,123,269,26033,27771,53797,3049,2564,5609
8,SED20906,29585,31447,61033,1986,1864,3853,4377,4023,8399,...,2263,134,196,334,27913,29647,57563,1217,1403,2619
9,SED21001,33317,36028,69349,1804,1751,3555,4080,3867,7945,...,2035,216,311,531,31059,33547,64604,1828,2290,4117


It so turns out we can do this with `numpy` quite easily...

In [9]:
total_population = sum(population_data["Tot_P_P"])

In [10]:
weighted_proportions = population_data["Tot_P_P"] / total_population

In [11]:
weighted_proportions[:5]

0    0.012815
1    0.015472
2    0.012267
3    0.010492
4    0.009993
Name: Tot_P_P, dtype: float64

In [12]:
sed_id_list = population_data["SED_CODE_2016"]

In [13]:
np.random.choice(sed_id_list, p=weighted_proportions)

'SED20401'

Great! So we can just add these lines to the travel_times function. 

Let's just do a sanity check to make sure that SEDs with higher populations are getting selected more often...

In [14]:
population_data[population_data["SED_CODE_2016"] ==  "SED20106"].Tot_P_P.values[0]

75951

In [15]:
for i in range(0, 100):
    sed = np.random.choice(sed_id_list, p=weighted_proportions)
    print("SED {}, TOTAL POPULATION {}"\
          .format(sed,
                  population_data[population_data["SED_CODE_2016"]==sed].Tot_P_P.values[0]))

SED SED24904, TOTAL POPULATION 63296
SED SED26504, TOTAL POPULATION 56972
SED SED26708, TOTAL POPULATION 59969
SED SED27504, TOTAL POPULATION 69152
SED SED22802, TOTAL POPULATION 57261
SED SED23101, TOTAL POPULATION 59610
SED SED22305, TOTAL POPULATION 74906
SED SED21001, TOTAL POPULATION 69349
SED SED25205, TOTAL POPULATION 62960
SED SED23308, TOTAL POPULATION 62886
SED SED25003, TOTAL POPULATION 68583
SED SED28803, TOTAL POPULATION 89162
SED SED20604, TOTAL POPULATION 64222
SED SED26504, TOTAL POPULATION 56972
SED SED24308, TOTAL POPULATION 65358
SED SED22105, TOTAL POPULATION 97040
SED SED26406, TOTAL POPULATION 71764
SED SED27807, TOTAL POPULATION 75221
SED SED25905, TOTAL POPULATION 66222
SED SED22607, TOTAL POPULATION 67008
SED SED25802, TOTAL POPULATION 66601
SED SED23602, TOTAL POPULATION 56943
SED SED26303, TOTAL POPULATION 64788
SED SED21906, TOTAL POPULATION 65552
SED SED27608, TOTAL POPULATION 69402
SED SED25205, TOTAL POPULATION 62960
SED SED21106, TOTAL POPULATION 62498
S

Okay, it definitely looks alright! At the very least, we can see that the "SED" (in reality, just an SED for documentation purposes that doesn't correspond to an actual area) with 29, and with 7565 inhabitants, doesnt occur once (which you'd expect in an equally weighted sample of 100 SEDs). 

We can just directly wrap this into the `travel_times.py` file now (excluding the last two SEDs as there are is no location data for them, given they're not locations!)

> This notebook code has now been integrated into `travel_times.py`.

Now that we've integrated it into `travel_times.py`, let's just test it out.

In [16]:
from modules.travel_times import travel_times

In [18]:
n_rural = 0
n_metro = 0
for i in [travel_times(travel_time_data, population_data[:-2]) for i in range(1000)]:
    if i[4] == "rural":
        n_rural += 1
    else:
        n_metro += 1

print("{} RURAL, {} METRO".format(n_rural, n_metro))

351 RURAL, 649 METRO
