# Data transformation

This jupyter notebook contains open source code for transforming a education and employment data. The data “Education and Employment, ASGS and LGA, 2011, 2014-19” is originally from “1410.0 - Data by Region, 2014-19” collection [1] by the Australian Bureau of Statistics (ABS). This data collection contains information about the education and employment in Australia between 2014 and 2019. The ABS (https://www.abs.gov.au) is a national statistical agency in Australia, and provides a wide range of statistical data collections on economic, population, environmental, and social issues.

In [1]:
# import python libraries

import pandas as pd

In [17]:
input_file = '14100do0005_2014-19.xlsx'
output_file = 'transformed_data.csv' 

In [18]:
# read data file

df = pd.read_excel(input_file, sheet_name = 'Table 1', skiprows=6, header=0, index_col=0)
df = df.dropna(subset=['Year'], how='all') # drop rows where it contains null 

In [14]:
selected_columns = [
    'Label', 
    'Year',
    'Completed Year 8 or below (%)',
    'Completed Year 9 or equivalent (%)',
    'Completed Year 12 or equivalent (%)',
    'Completed Year 11 or equivalent (%)',
    'Completed Year 10 or equivalent (%)',
    'Did not go to school (%)',
    'Highest Year of School Completed - Not stated (%)',
    ]
selected_rows = [
    'Australia',
    'New South Wales', 
    'Victoria',
    'Queensland',
    'South Australia',
    'Western Australia',
    'Tasmania',
    'Northern Territory',
    'Australian Capital Territory'
]
selected_year = 2016

In [15]:
# extract data from file

extracted_df = pd.DataFrame(columns=['Label'])
df = df[selected_columns] # extract data by columns
for index, row in df.iterrows():
    if row['Label'] in selected_rows and row['Year'] == selected_year: # extract data by rows and year
        extracted_df = extracted_df.append(row)    

extracted_df.Year = extracted_df.Year.astype(int) # convert data type to int
extracted_df = extracted_df.drop_duplicates(subset=['Label', 'Year'])# remove duplications

In [19]:
# save transformed data file

extracted_df.to_csv(output_file, sep=',', encoding='utf-8', index=False)

## References

[1] Australian Bureau of Statistics, 1410.0 - Data by Region, 2014-19, Population and People, ASGS and LGA, 2011, 2014-2019, Australian Bureau of Statistics, 2020. [Dataset] Available:https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1410.02014-19?OpenDocument. [Accessed: January 4, 2021].