We want to estimate, in a Bayesian way, the likelihood of a person becoming personally insolvent, given certain information about them.
The important information we're given about individuals innon-compliance-in-personal-insolvencies.csv
is:
- SA3 of debtor
- Sex of debtor
- Family situation
- Debtor occupation code (N.B., these seem to be Sub-Major Groups in the ANZCO ontology, see http://www.abs.gov.au/ANZSCO)
Because we don't have the joint distribution of Debtor occupation and family situation, we can't do this with a single model. Instead, we'll have to construct two models:
- Estimating Pr(non-compliance) given SA3, sex, and family situation
- Estimating Pr(non-compliance) given SA3, sex, and debtor occupation
B25 is the census dataset decribing 'Family Composition'.
We need to aggregate B25 to produce the categories in family situation
in non-compliance-in-personal-insolvencies.csv
- Find all unique family situations in
non-compliance-in-personal-insolvencies.csv
- Express each such family situation in terms of the columns in B25
- Produce a new version of B25 whose columns are the family situations found in (1)
The Debtor occupation codes in non-compliance-in-personal-insolvencies.csv
are Sub-Major Groups in the ANZCO ontology, see http://www.abs.gov.au/ANZSCO. However census data only has Major Groups. Consequently our model will only be able to ooperate on ANZSCO major groups. The relevant census datasets are either B45 Occupation by Age by Sex (in which case age needs to be marginalised out) or T34 Occupation By Sex (which is time-series data for each census year). We ended up using B57A.
- Add a columns associating each Sub-Major group in
non-compliance-in-personal-insolvencies.csv
to its parent ANZSCO Major Group
https://data.gov.au/dataset/non-compliance-personal-insolvencies
Raw data:
data/afsa/
└── non-compliance-in-personal-insolvencies.csv