
<h3>CITS2402 Lab 3</h3>

# Exploring the Census Age Data

<div>
    <img src="barking-board-tiny.png" width=500)>
    <br><br>
</div>

<sup>Image: https://www.abc.net.au/news/2017-08-03/census-2016-ordinary-australia-probably-isnt-where-you-think/8680052</sup>

In *The Census and the "Typical" Australian?"*, we discussed the age chosen for Clare (38), the "Typical" Australian, and Paul (37), the "Typical" Western Australian.

We saw that the ABS chose to use the *mode* for most attributes, while for age they chose to use the *median*. We asked why they may have made that decision. (Noting as well that in their Quickstats they also used some *means* (averages), somewhat confusingly, for the same attributes.)

In this lab we'll explore the data, and see if you think that choice is justified.


## Data Acquisition

(Note: Finding and extracting the data _IS part of the exercise_ - if you just get the data from someone else you will be missing the opportunity to do a typical data exploration. If the next part doesn't make sense, you may want to revisit the case study in the lectures first.)

* Find and download the 2016 Census *General Community Profile* datapack from the ABS to your computer, and unzip the package.

#### Readme files

By convention, authors of data commonly put explanations about the data in one or more text files (so they're readable from the shell without an editor)  called `readme.txt` or similar.

* Have a scan through the available readme files starting with `AboutDatapacks_readme.txt`.

You can come back to these if you need more information on understanding the data.

#### Metadata

_Metadata_ (meaning, roughly, "above the data") is additional information to the raw data. In this case it tells you what tables of data (stored as separate files) there are in the package, and the descriptors used in those tables.

* Open the Metadata spreadsheet for the DataPack (in your preferred spreadsheet software). 
* Find the table that reports age data from the Census. (Hint: This is the one that treats Age as a _dependent variable_, rather than an _independent variable_.)

* Note the table number, and go to the Cell Descriptors for that data. 

You should find age categories starting from the age of zero.

>  _Q: How many categories does a 20 year old identifying as male appear in?_
>
> _Q: What happens when you get to 80?_

You should see that people over 80 do not get the same representation as those under 80. Can you think of any good reason to justify this blatant ageism? 

#### Data files

* Open the data directory from the package and find the data files corresponding to the tables.

* Find the files corresponding to the Age data.

What format are the data stored in?

Check the size of the files. You should find they are around 4KB and 2KB.

* Upload the relevant data files to the directory for this Lab in CoCalc. (The two files should each be less than 5KB. Do **not** upload the entire 6.5MB directory.)

Knowing a few Unix commands (as mentioned in the Getting Started lab sheet) provides a very quick way of finding your way around directories and data.

* Open a shell ("Linux Terminal" ) in CoCalc.
  * Use `ls -l` to list the contents of your directory. You should see the uploaded files with their sizes (in number of bytes) listed on the left side of the dates.
  * Use `more` _filename_ to show the contents of one of the data files. (Tip: you can use Tab for filename completion, up to the point where two filenames differ.)

Is it what you would have expected? Is there extra information such as headers, footers, or metadata that you will need to watch out for? What would be your strategy for getting all the data into a useable form in python?

### Reading the Data into the Notebook

* Set up constants `AGE_DATA_A` and `AGE_DATA_B` to point to the age data files. Read in and print out the data to ensure you have accessed it correctly.

You should see the same thing you saw using the Unix `more` command.

In [6]:
AGE_DATA_A="2016Census_G04A_AUS.csv"
AGE_DATA_B="2016Census_G04B_AUS.csv"

In [7]:
AandB=[AGE_DATA_A,AGE_DATA_B]

## Data Cleaning

We are going to examine the age profile of the *population as a whole*.

Have another look at the relevant Cell Descriptors in the metadata spreadsheet. Which rows of data do you need for this task?

Let's start by pulling the data into lists and cleaning it to leave just the data we need.



#### 1. Checked solution [1 prac mark]

* Write a function `read_data(files)` that
  * takes as input a list of file names containing the raw data
  * reads and cleans the data so that it contains only data for the 1 year age categories for persons, the 5 year cumulative categories for persons and the final 100+ category (in the same order as the original file), and the corresponding numbers
  * tidies up the category names by removing the `_M`, `_F` and `_P` suffixes
  * returns a pair `(categories, numbers)` containing a list of the categories (as strings), and a list of the corresponding numbers (as integers)
  

An example output might be:

```
(categories, numbers) = read_data([AGE_DATA_A, AGE_DATA_B])
print(categories[:6])
print(numbers[:6])

['Age_yr_0', 'Age_yr_1', 'Age_yr_2', 'Age_yr_3', 'Age_yr_4', 'Age_yr_0_4']
[276227, 293503, 295142, 299725, 300184, 1464779]
```

Tip: If you are having trouble, a similar example is covered in the _Census_ case study in the lectures.



In [8]:
def read_data (files):
    categories=[]
    numbers=[]

    for file in files:
        with open(file,'r') as data:
            readdata=data.read()
            split_data=readdata.splitlines()
            header=split_data[0].split(',')
            footer=split_data[1].split(',')
            header_nocode=header[1:] #remove auscode
            footer_nocode=footer[1:] #remove code
            categories+=header_nocode
            numbers+=footer_nocode


    indexes=slice(2,len(categories),3) #get index of male and female
    categories1=categories[indexes][:-1] #categorise1 #remove the last bit
    num=numbers[indexes][:-1] #num

    ages=[]
    for each_age in categories1:
        age=each_age[:-2]
        ages.append(age)

    nums=[]
    for each_num in num:
        nums.append(int(each_num))

    categories=ages
    numbers=nums

    return categories,numbers

#AandB=[AGE_DATA_A,AGE_DATA_B]

read_data(AandB)

(['Age_yr_0',
  'Age_yr_1',
  'Age_yr_2',
  'Age_yr_3',
  'Age_yr_4',
  'Age_yr_0_4',
  'Age_yr_5',
  'Age_yr_6',
  'Age_yr_7',
  'Age_yr_8',
  'Age_yr_9',
  'Age_yr_5_9',
  'Age_yr_10',
  'Age_yr_11',
  'Age_yr_12',
  'Age_yr_13',
  'Age_yr_14',
  'Age_yr_10_14',
  'Age_yr_15',
  'Age_yr_16',
  'Age_yr_17',
  'Age_yr_18',
  'Age_yr_19',
  'Age_yr_15_19',
  'Age_yr_20',
  'Age_yr_21',
  'Age_yr_22',
  'Age_yr_23',
  'Age_yr_24',
  'Age_yr_20_24',
  'Age_yr_25',
  'Age_yr_26',
  'Age_yr_27',
  'Age_yr_28',
  'Age_yr_29',
  'Age_yr_25_29',
  'Age_yr_30',
  'Age_yr_31',
  'Age_yr_32',
  'Age_yr_33',
  'Age_yr_34',
  'Age_yr_30_34',
  'Age_yr_35',
  'Age_yr_36',
  'Age_yr_37',
  'Age_yr_38',
  'Age_yr_39',
  'Age_yr_35_39',
  'Age_yr_40',
  'Age_yr_41',
  'Age_yr_42',
  'Age_yr_43',
  'Age_yr_44',
  'Age_yr_40_44',
  'Age_yr_45',
  'Age_yr_46',
  'Age_yr_47',
  'Age_yr_48',
  'Age_yr_49',
  'Age_yr_45_49',
  'Age_yr_50',
  'Age_yr_51',
  'Age_yr_52',
  'Age_yr_53',
  'Age_yr_54',
  'Age_yr

In [21]:
from nose.tools import assert_equal
AGE_DATA_A = '2016Census_G04A_AUS.csv'
AGE_DATA_B = '2016Census_G04B_AUS.csv'
(categories, numbers) = read_data([AGE_DATA_A,AGE_DATA_B])
assert_equal(categories[0],'Age_yr_0')
assert_equal(categories[-1],'Age_yr_100_yr_over')
assert_equal(numbers[-1],3569)


In [10]:
read_data(AandB)[0]

['Age_yr_0',
 'Age_yr_1',
 'Age_yr_2',
 'Age_yr_3',
 'Age_yr_4',
 'Age_yr_0_4',
 'Age_yr_5',
 'Age_yr_6',
 'Age_yr_7',
 'Age_yr_8',
 'Age_yr_9',
 'Age_yr_5_9',
 'Age_yr_10',
 'Age_yr_11',
 'Age_yr_12',
 'Age_yr_13',
 'Age_yr_14',
 'Age_yr_10_14',
 'Age_yr_15',
 'Age_yr_16',
 'Age_yr_17',
 'Age_yr_18',
 'Age_yr_19',
 'Age_yr_15_19',
 'Age_yr_20',
 'Age_yr_21',
 'Age_yr_22',
 'Age_yr_23',
 'Age_yr_24',
 'Age_yr_20_24',
 'Age_yr_25',
 'Age_yr_26',
 'Age_yr_27',
 'Age_yr_28',
 'Age_yr_29',
 'Age_yr_25_29',
 'Age_yr_30',
 'Age_yr_31',
 'Age_yr_32',
 'Age_yr_33',
 'Age_yr_34',
 'Age_yr_30_34',
 'Age_yr_35',
 'Age_yr_36',
 'Age_yr_37',
 'Age_yr_38',
 'Age_yr_39',
 'Age_yr_35_39',
 'Age_yr_40',
 'Age_yr_41',
 'Age_yr_42',
 'Age_yr_43',
 'Age_yr_44',
 'Age_yr_40_44',
 'Age_yr_45',
 'Age_yr_46',
 'Age_yr_47',
 'Age_yr_48',
 'Age_yr_49',
 'Age_yr_45_49',
 'Age_yr_50',
 'Age_yr_51',
 'Age_yr_52',
 'Age_yr_53',
 'Age_yr_54',
 'Age_yr_50_54',
 'Age_yr_55',
 'Age_yr_56',
 'Age_yr_57',
 'Age_yr_58',
 

### Data Augmentation

Later we are going to plot the data. First, however, as the ABS have inexplicably left out people over 80, we will have to reconstruct the data as best we can. This is akin to the *Statistical Imputation* we discuss in the *Census* case study (where we discuss Hot Decking, Mean Substitution, Probabilistic and Regression methods).

We'll need to fill in the data for each year group. We'll start with a simple method, and move on to a more challenging one.

As a first approximation we will simply use an averaging approach. We'll make an *assumption* that the number of people in a date range is approximately the same for each year in that category (with excess people added "from the left").

#### 2. Checked Solution [1 prac mark]

* Write a function `spread(age_cat, num)` that:
  * takes as arguments a date range string `age_cat` and a non-negative integer `num`
    * where `age_cat` represents a date range in the format `Age_yr_`*n1_n2*, where *n1* <= *n2* are _any valid human ages_ (eg. `Age_yr_80_84`)
  * returns a pair (2-tuple) of lists:
    * the first list should contain strings in the form `Age_yr_`*n* where *n* ranges from the first to the last year in the range (eg. `Age_yr_80`, `Age_yr_81`,... in the above example)
    * the second list contains a list of integers adding up to `num` such that:
      * if `num` is divisible by the number of years in the range, then each integer should be the same (ie. the people are evenly spread)
      * if there is a remainder from dividing `num` by the number of years, the remainder should be added one at a time to the year groups starting at the lowest age (80 in this example) until there are no more remaining

For example, `spread('Age_yr_80_84', 7)` should return:

```
(['Age_yr_80', 'Age_yr_81', 'Age_yr_82', 'Age_yr_83', 'Age_yr_84'], [2, 2, 1, 1, 1])
```

Tip: You might find `divmod` useful. Also, recall that `+` and `*` can be used to create lists.

> _Good Programming Tip_: When parsing strings (for example) it is better where possible (more robust, more readable, more maintainable) to use components that have some "meaning" to break down objects, rather than numerical indices.
>
> For example, in _DiversityInStudy_ most of the units were in the form `STAT2023-2`. To get the teaching period, we could use something like `unit[9]`. However, later in the file there were some non-standard semesters, such as `SVLG1225-Y3`, which would break this code.
>
> Instead we could use something like `(code,_,period) = unit.partition("-")` which recognises the logical _role_ played by the dash - left of the dash means one thing, right of the dash another. 
>
> This code will work equally well for  `SVLG1225-Y3` because it is based on the roles of the components, rather than their exact numerical position.



In [11]:
divmod(7,5)

(1, 2)

In [12]:
7%5

2

In [13]:
def spread(age_cat,num):
    individual_age=[]
    individual_num=[]

    field=age_cat.split('_')[-2:] #there is 4 underscore,get the last 2 for age number

    if field[1]>=field[0]:
        start=int(field[0])
        end=int(field[1])+1
        age_range=end-start
        for i in range(start,end):
            individual_age.append('Age_yr_'+str(i))

    divi=divmod(num,age_range)
    for j in range(age_range):
        individual_num.append(int(divi[0]))

    remain=int(divi[1])
    for num_data in range(age_range):
        if remain>0:
            individual_num[num_data] += 1
            remain=remain-1






    return(individual_age,individual_num)


spread('Age_yr_80_84',7)

(['Age_yr_80', 'Age_yr_81', 'Age_yr_82', 'Age_yr_83', 'Age_yr_84'],
 [2, 2, 1, 1, 1])

In [14]:
from nose.tools import assert_equal
assert_equal(spread('Age_yr_80_84', 7), (['Age_yr_80', 'Age_yr_81', 'Age_yr_82', 'Age_yr_83', 'Age_yr_84'], [2, 2, 1, 1, 1]))
assert_equal(spread('Age_yr_80_84', 460549), (['Age_yr_80', 'Age_yr_81', 'Age_yr_82', 'Age_yr_83', 'Age_yr_84'], [92110, 92110, 92110, 92110, 92109]))
assert_equal(spread('Age_yr_80_89', 7), (['Age_yr_80', 'Age_yr_81', 'Age_yr_82', 'Age_yr_83', 'Age_yr_84', 'Age_yr_85', 'Age_yr_86', 'Age_yr_87', 'Age_yr_88', 'Age_yr_89'], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]))
assert_equal(spread('Age_yr_59_62', 0), (['Age_yr_59', 'Age_yr_60', 'Age_yr_61', 'Age_yr_62'], [0, 0, 0, 0]))
print("So far, so good. Additional test cases will be applied.")


So far, so good. Additional test cases will be applied.


* Using your `spread` function, write a function `augment_data(categories, numbers)` that completes (reconstructs) the lists of age categories and numbers, following the pattern of the lower age years, as follows:
  * for the age ranges from 80-84 to 95-99, insert the categories and ages for each year calculated by spreading the total number over the 5 years
  * for the age range 100 years and over, assume the ages are spread over the range 100-104 (again a poor assumption) and insert as above

For example, your output for the list of categories (age ranges) should end like this:
```
...'Age_yr_98', 'Age_yr_99', 'Age_yr_95_99', 'Age_yr_100', 'Age_yr_101', 'Age_yr_102', 'Age_yr_103', 'Age_yr_104', 'Age_yr_100_yr_over']
```

* Combine your list of categories and your list of numbers into a list of pairs containing *only entries for each single year*.
* In the process also strip out "`yr_`" from the category names.

Tip: Use `zip` to combine the lists.

Your list should now start like this:
```
[('Age_0', 276227),
 ('Age_1', 293503),
 ('Age_2', 295142),
 ('Age_3', 299725),
 ('Age_4', 300184),
 ('Age_5', 298271),
 ('Age_6', 302901),
 ('Age_7', 299413),
 ...
```

* Check your list has the number of entries that you would expect.
* As usual, check your work with unit tests.


#### 3. Checked Solution [1 prac mark]

- Drawing from your work so far, define a function `cleaned_data()` that:
  - reads in the age data from the csv files
  - augments the data with the missing age data as described above
  - keeps only the necessary data \(number of persons for each year\)
  - shortens the category names to the form `Age_`_n_
  - returns the data in a list of pairs in the form \(_categ ory\_name_, _number_\)

Check that your output is as you would expect, before applying the sample tests.

Note that your function can call functions you have defined previously. As usual, however, check that it works with a fresh kernel.



In [19]:
def cleaned_data(file_list):
    #for age 85-99 spread the age
    get_data=read_data(file_list)

    number=[]
    category_name=[]
    individual_age=[]
    k=0

    for age_data in get_data[0]:
        index=age_data.split('_') #['Age', 'yr', '0']

        #spread age range for above 80
        if int(index[2]) > 79:

            field=age_data.split('_')[-2:] #there is 4 underscore,get the last 2 for age number 'Age_yr_75_79'
            if field[1]>=field[0]: #79-75
                start=int(field[0])
                end=int(field[1])+1
                age_range=end-start
                for i in range(start,end):
                    individual_age.append('Age_yr_'+str(i))

        #age above 100
        if age_data == 'Age_yr_100_yr_over':
            for i in range(5):
                individual_age.append('Age_yr_'+ str(100+i))
            individual_age.append('Age_yr_100_yr_over')



        else: #LESS THAN 80
            if len(index) < 4: #less than 3 underscore
                individual_age.append(age_data)

        if
        number.append(get_data[1])



        k+=1




    print(individual_age,number)

cleaned_data(AandB)

SyntaxError: invalid syntax (1233314973.py, line 36)

In [0]:
from nose.tools import assert_equal
AGE_DATA_A = '2016Census_G04A_AUS.csv'
AGE_DATA_B = '2016Census_G04B_AUS.csv'
squeaky = cleaned_data([AGE_DATA_A, AGE_DATA_B])
assert_equal(squeaky[0], ('Age_0', 276227))
assert_equal(squeaky[80], ('Age_80', 92110))
assert_equal(squeaky[-1][0], 'Age_104')
assert_equal(len(squeaky), 105)
print("So far, so good. Additional test cases will be applied.")


## Data Analysis

#### 4. Checked Solution [1 prac mark]

For the last checked solution of this lab, you are given less "scaffolding" and it is up to you to decide how to best break down the problem.

* Without using any libraries, write a function `central_measures(clean_data)` that takes a list of `(age_year, number)` pairs, and returns a triple `(mean, median, mode)` containing the _measures of central tendency_ for age in years.
  * Where the _mean_ falls between two integers, your function should round to the nearest year. Where it is exactly half way, it should choose the even year (following standard practice). For example, if the calculated mean is 51.5, the function would return 52.
  * Where the _median_ falls between two age groups, it should return the higher age group. For example, if the population consisted only of 100 people aged 20, and 100 people aged 21, the function would return 21 (since the ages will be clustered around 21.0).
  * If there is more than one _mode_, your function should return the lowest age mode.

* You should endeavour to make your function "space efficient". For example, rather than calculating the median by creating a (very long) list of all the people, think about using a count of people in each age group to find the median.

Test your function thoroughly. Some example (unit) tests are included below. Do you agree with the answers given?


In [0]:
def central_measures (clean_data):
    # YOUR CODE HERE
    raise NotImplementedError()


In [0]:
from nose.tools import assert_equal
AGE_DATA_A = '2016Census_G04A_AUS.csv'
AGE_DATA_B = '2016Census_G04B_AUS.csv'
datalist = [AGE_DATA_A, AGE_DATA_B]
assert_equal(central_measures(cleaned_data(datalist)[:1]), (0,0,0))
assert_equal(central_measures(cleaned_data(datalist)[:2]), (1,1,1))
assert_equal(central_measures(cleaned_data(datalist)[:4]), (2,2,3))
assert_equal(central_measures([('Age_1', 1), ('Age_2', 1)]), (2,2,1))
print("So far, so good.")


Notice that in generating the characterisation of the "typical Australians" the ABS would have had to make similar assumptions about filling in the missing data as we have had to here.

_Q: What did you find for the measures of central tendency for your data as a whole? How do they compare with the "typical" Australians?_

_Which of these statistics could depend on the assumptions we made when filling in the missing data?_

## Data Visualisation

For this section you may wish to refer to examples in the lecture case studies. You are also expected to refer to the python API for the details of functions. In this case, this includes
[matplotlib.pyplot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html). Note that for this lab sheet you do _not_ need to import `numpy` as shown in the API documentation - we will come to numpy later.

* Use `pyplot` to produce a basic vertical bar chart for your data.

How would you improve your chart to better understand the data from the plot?

* Improve your plot to make it 'production-ready':
  * choose an appropriate width for the plot that allows individual years to be better distinguished
  * place the ticks at 5 year intervals along the x-axis
  * label the ticks, with the labels written vertically
  * label your axes and give the chart a title

The bottom of your plot should now look like this:

<div><br><img src="bottom-of-plot.png" width=1000></div>

_Q: What can you learn from the plot? What do you make of the ABS' assertion that the 'typical Australians' are in the 36-38 year range? Looking at the distribution, what age would you choose?_

Q: Peter Costello (see the _Census_ case study) introduced the 'baby bonus' around 2002, and made his plea to parents in 2004. The census was taken in 2016. Do you see any (circumstantial) evidence that parents may have heeded his plea? (Bearing in mind, of course, that correlation does not imply causation.)

## Challenge

We filled in the missing data by effectively averaging across each age range (in a similar spirit to _Mean Substitution_). This hasn't turned out too badly, but we could do better.

* While staying true to the total number in each age range, improve your method for filling in the missing data to make it as realistic as you can. Remember that you would like to try to use a *general* approach that would also work on other sets of data, rather than one that is *specific* to this set of data.

Which of the measures of central tendency could be affected by this? Check whether the measures have changed.

### Extra curricular (not examinable) - for those who have studied Data Structures and Algorithms or similar

What is the time complexity of your functions?

- A: In general?
- B: For a human population?

Can they be improved?

## Extra Practice: More Babies

The _Census_ case study includes an exercise to change the line plot so that it uses the percentage on the y-axis rather than the raw numbers, and to see if there is evidence there for the Treasurer's baby boom.

* Go ahead and complete the exercise if you haven't done so already.

&copy; Cara MacNish, UWA