#### BNZ Advanced Analytics MLE Technical Exercise - Marcus David Buckland

Code & Responses to the questions found in `Instructions.pdf` below.

In [1]:
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(42)

In [2]:
EMPLOYMENT_DATA_FILENAME = 'business-employment-data-march-2024-quarter.zip'
FINANCIAL_DATA_FILENAME = 'business-financial-data-march-2024.zip'

In [3]:
employment_df = pd.read_csv(f'data/{EMPLOYMENT_DATA_FILENAME}')
employment_df.head(3)

Unnamed: 0,Series_reference,Period,Data_value,Suppressed,STATUS,UNITS,Magnitude,Subject,Group,Series_title_1,Series_title_2,Series_title_3,Series_title_4,Series_title_5
0,BDCQ.SEA1AA,2011.06,80078.0,,F,Number,0,Business Data Collection - BDC,Industry by employment variable,Filled jobs,"Agriculture, Forestry and Fishing",Actual,,
1,BDCQ.SEA1AA,2011.09,78324.0,,F,Number,0,Business Data Collection - BDC,Industry by employment variable,Filled jobs,"Agriculture, Forestry and Fishing",Actual,,
2,BDCQ.SEA1AA,2011.12,85850.0,,F,Number,0,Business Data Collection - BDC,Industry by employment variable,Filled jobs,"Agriculture, Forestry and Fishing",Actual,,


In [4]:
financials_df = pd.read_csv(f'data/{FINANCIAL_DATA_FILENAME}')
financials_df.head(3)

Unnamed: 0,Series_reference,Period,Data_value,Suppressed,STATUS,UNITS,Magnitude,Subject,Group,Series_title_1,Series_title_2,Series_title_3,Series_title_4,Series_title_5
0,BDCQ.SF1AA2CA,2016.06,1116.386,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable (NZSIOC Level 2),Sales (operating income),Forestry and Logging,Current prices,Unadjusted,
1,BDCQ.SF1AA2CA,2016.09,1070.874,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable (NZSIOC Level 2),Sales (operating income),Forestry and Logging,Current prices,Unadjusted,
2,BDCQ.SF1AA2CA,2016.12,1054.408,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable (NZSIOC Level 2),Sales (operating income),Forestry and Logging,Current prices,Unadjusted,


### EDA

___

Before getting underway with the five technical questions, it's prudent to have an understanding of both of the datasets that will be utilsed in answering the questions.

These stats.govt.nz website provides the following pages related to these datasets:

1. [Business employment data: March 2024 quarter](https://www.stats.govt.nz/information-releases/business-employment-data-march-2024-quarter/)
1. [Business financial data: March 2024 quarter](https://www.stats.govt.nz/information-releases/business-financial-data-march-2024-quarter/)

<u>The Business employment dataset contains data for 14 tables:</u>

1. Actual employment by industry, filled jobs
1. Seasonally adjusted employment by industry, filled jobs
1. Employment by industry, trend, filled jobs
1. Actual employment by industry, total earnings
1. Actual employment by sex, filled jobs
1. Actual employment by sex, total earnings
1. Actual employment by region, filled jobs
1. Seasonally adjusted employment by region, filled jobs
1. Employment by region, trend, filled jobs
1. Actual employment by region, total earnings
1. Actual employment by territorial authority, filled jobs
1. Actual employment by territorial authority, total earnings
1. Actual employment by age, filled jobs
1. Actual employment by age, total earnings


<u>The Business financial dataset contains data for five tables:</u>

1. Business financial data by industry group - actual sales value							
1. Business financial data by industry group - seasonally adjusted sales value							
1. Business financial data by industry group - actual purchases value							
1. Business financial data by industry group - actual salaries and wages values							
1. Business financial data by industry group - actual operating profit values							

In [5]:
# EDA of Financials Dataset
financials_df.shape # 7635 rows, 14 columns

(7635, 14)

In [6]:
# What are the column names?
[x for x in financials_df.columns]

['Series_reference',
 'Period',
 'Data_value',
 'Suppressed',
 'STATUS',
 'UNITS',
 'Magnitude',
 'Subject',
 'Group',
 'Series_title_1',
 'Series_title_2',
 'Series_title_3',
 'Series_title_4',
 'Series_title_5']

In [7]:
financials_df['Series_reference'].nunique() # 240 unique series references in the dataset

240

In [8]:
# Let's examine the Series_reference column- find 10 random examples
random_row_numbers = rng.integers(0, len(financials_df), size=10)
series_ref_examples =financials_df.loc[random_row_numbers, 'Series_reference']
series_ref_examples

681     BDCQ.SF1CC3CA
5909     BDCQ.SF3IICA
4997     BDCQ.SF2QQCA
3350     BDCQ.SF1PPCT
3306     BDCQ.SF1PPCS
6555    BDCQ.SF8CC1CA
656     BDCQ.SF1CC2CT
5324    BDCQ.SF3CC2CA
1538     BDCQ.SF1DDCA
719     BDCQ.SF1CC3CS
Name: Series_reference, dtype: object

The series reference can be decomposed- Let's examine in further detail, the first example in our series_ref_examples list: \
* `BDCQ.SF1CC3CA`

* The first 3 chars `BDC` tell us that this dataset is part of the `Business data collection` group.
* The fourth char `Q` tells us that this data is collected on a `Quarterly` basis

Following the full-stop
* There are four categories for the next 3 chars: `SF1`, `SF2`, `SF3`, and finally `SF8`. This corresponds to the variable in the column `Series_title_1`.

* `SF1`= Sales (operating income)
* `SF2`= Purchases and operating expenditure
* `SF3`= Salaries and wages
* `SF8`= Operating profit

The following two chars correspond to the industry type, although it's important to note that it is not an exact 1:1 mapping from char_code : industry type. e.g.

`AA` maps to: Forestry and logging, Fishing, Aquaculture and Agriculture, Forestry and Fishing Support Services, Agriculture, Forestry and Fishing \
`BB` is mining \
`CC` maps to 10 "different" industries contained within the columns- however, the common theme amongst those 10 industries is Manufacturing. \
`DD` is Electricity, Gas, Water and Waste Services. \
`EE` is Construction

The next char futher specifies the industries into their various sub-categories. For example, as mentioned above, `CC` contains manufacturing. If you wish to see each of the industries broken down further, for Manufacturing we see:

| Industry Code | Industry                                                      |
|---------------|---------------------------------------------------------------|
| CC1           | Food, Beverage and Tobacco Product   Manufacturing            |
| CC2           | Textile, Leather, Clothing and Footwear Manufacturing         |
| CC3           | Wood and Paper Products Manufacturing                         |
| CC4           | Printing                                                      |
| CC5           | Petroleum, Chemical, Polymer and Rubber Product Manufacturing |
| CC6           | Non-Metallic Mineral Product Manufacturing                    |
| CC7           | Metal Product Manufacturing                                   |
| CC8           | Transport Equipment, Machinery and Equipment Manufacturing    |
| CC9           | Furniture and Other Manufacturing                             |
| CCC           | Manufacturing                                                 |

The next char is always `C` - My assumption is that it stands for `Current` or `Current prices`.

Finally, the last three chars can be `A`, `S`, and `T` - these correspond to how the data values may have been adjusted.

`A` = Actual \
`S` = Seasonally adjusted \
`T` = Trend
