# Cleaning Employee Exit Surveys

In this lab, you'll clean exit surveys from employees of the [Department of Education, Training and Employment (DETE)](https://en.wikipedia.org/wiki/Department_of_Education_and_Training_(Queensland)}) and the Technical and Further Education (TAFE) body of the Queensland government in Australia. The TAFE exit survey can be found [here](https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q=exit%20survey) and the survey for the DETE can be found [here](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey).

Complete the tasks listed below. You can submit the completed lab until 11:59 PM in the night.

<u>Requirement:</u><br>
Do your best to write Pythonic code instead of the traditional programming code.

<u>Hint:</u><br>
For all of these tasks, you would need to read the data first using __pd.read_csv__

### Task 1 (2 marks)

Import the necessary libraries, read the data __dete_survey.csv__ as well as __tafe_survey.csv__, and ensure that the columns are not truncated (that is, by default pandas shows only the first few columns followed by ellipsis followed by the last few columns).

<u>Hint:</u> pd.options...

Once you have done that, output some information about your columns such as how many columns are there, the data types of the columns etc.

In [33]:
### Write your code below this comment.
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
dete = pd.read_csv('data/dete_survey.csv')
tafe = pd.read_csv('data/tafe_survey.csv')
print('There are', dete.shape[0], 'rows and', dete.shape[1], 'columns in the DETE dataframe. \n')
print('Column Names and Types in DETE:\n', dete.dtypes)

There are 822 rows and 56 columns in the DETE dataframe. 

Column Names and Types in DETE:
 ID                                      int64
SeparationType                         object
Cease Date                             object
DETE Start Date                        object
Role Start Date                        object
Position                               object
Classification                         object
Region                                 object
Business Unit                          object
Employment Status                      object
Career move to public sector             bool
Career move to private sector            bool
Interpersonal conflicts                  bool
Job dissatisfaction                      bool
Dissatisfaction with the department      bool
Physical work environment                bool
Lack of recognition                      bool
Lack of job security                     bool
Work location                            bool
Employment conditions             

In [32]:
print('There are', tafe.shape[0], 'rows and', tafe.shape[1], 'columns in the TAFE dataframe. \n')
print('Column Names and Types in TAFE:\n', tafe.dtypes)

There are 702 rows and 72 columns in the TAFE dataframe. 

Column Names and Types in TAFE:
 Record ID                                                                                                                                                        float64
Institute                                                                                                                                                         object
WorkArea                                                                                                                                                          object
CESSATION YEAR                                                                                                                                                   float64
Reason for ceasing employment                                                                                                                                     object
Contributing Factors. Career Move - Public Sector              

### Task 2 (1 mark)

You should be able to make the following observation based on the work you did above in Task 1:
  - The __dete_survey__ dataframe contains `'Not Stated'` values that indicate values are missing, but they aren't represented as `NaN`.
  
Read the dataset __dete_survey.csv__ again such that the `'Not Stated'` values are replaced with NaN ensuring that the mechanism for representing missing values is consistent across the dataset.

<u>Hint:</u> Use the __na_values__ argument in __pd.read_csv__

In [35]:
### Write your code below this comment.
dete = pd.read_csv('data/dete_survey.csv', na_values ='Not Stated')

Unnamed: 0,ID,SeparationType,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,Career move to public sector,Career move to private sector,Interpersonal conflicts,Job dissatisfaction,Dissatisfaction with the department,Physical work environment,Lack of recognition,Lack of job security,Work location,Employment conditions,Maternity/family,Relocation,Study/Travel,Ill Health,Traumatic incident,Work life balance,Workload,None of the above,Professional Development,Opportunities for promotion,Staff morale,Workplace issue,Physical environment,Worklife balance,Stress and pressure support,Performance of supervisor,Peer support,Initiative,Skills,Coach,Career Aspirations,Feedback,Further PD,Communication,My say,Information,Kept informed,Wellness programs,Health & Safety,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
0,1,Ill Health Retirement,08/2012,1984.0,2004.0,Public Servant,A01-A04,Central Office,Corporate Strategy and Peformance,Permanent Full-time,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,True,A,A,N,N,N,A,A,A,A,N,N,N,A,A,A,N,A,A,N,N,N,Male,56-60,,,,,Yes
1,2,Voluntary Early Retirement (VER),08/2012,,,Public Servant,AO5-AO7,Central Office,Corporate Strategy and Peformance,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,A,A,N,N,N,N,A,A,A,N,N,N,A,A,A,N,A,A,N,N,N,Male,56-60,,,,,
2,3,Voluntary Early Retirement (VER),05/2012,2011.0,2011.0,Schools Officer,,Central Office,Education Queensland,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,A,A,N,N,N,N,Male,61 or older,,,,,
3,4,Resignation-Other reasons,05/2012,2005.0,2006.0,Teacher,Primary,Central Queensland,,Permanent Full-time,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,A,N,N,N,A,A,N,N,A,A,A,A,A,A,A,A,A,A,A,N,A,Female,36-40,,,,,
4,5,Age Retirement,05/2012,1970.0,1989.0,Head of Curriculum/Head of Special Education,,South East,,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,A,A,N,N,D,D,N,A,A,A,A,A,A,SA,SA,D,D,A,N,A,M,Female,61 or older,,,,,


### Task 3 (1 mark)

Both the __dete_survey__ and __tafe_survey__ contain many columns that we don't need to complete our analysis.

Drop columns 28-48 (inclusive of both ends) in __dete_survey__  as well as columns 17-65 in __tafe_survey__ and save the resulting dataframes as __dete_survey_updated__ and __tafe_survey_updated__ respectively. Also confirm that the number of columns in the new dataframes is less by 21 and 49 respectively compared with the original dataframes.

In [64]:
### Write your code below this comment.
dete_survey_updated = dete.drop(dete.columns[28:49], axis=1)
tafe_survey_updated = tafe.drop(tafe.columns[17:66], axis=1)
print("The number of columns is less by", len(dete.columns)-len(dete_survey_updated.columns), 
      "in the new DETE dataframe. \n")
print("The number of columns is less by", len(tafe.columns)-len(tafe_survey_updated.columns), 
      "in the new TAFE dataframe.")

The number of columns is less by 21 in the new DETE dataframe. 

The number of columns is less by 49 in the new TAFE dataframe.


### Task 4 (2 marks)

Rename all the columns in the dataframe __dete_survey_updated__ such that the following requirements are satisfied:

1. The column names are in lower case
2. Any leading and trailing spaces are removed from column names
3. Any space in column names is replaced with an underscore ( _ )

Also, rename the columns in the dataframe __tafe_survey_updated__ such that they match the names in __dete_survey_updated__. You can use the following dictionary for mapping:

    {'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment': 'separationtype', 
    'Gender. What is your Gender?': 'gender', 'CurrentAge. Current Age': 'age', 
    'Employment Type. Employment Type': 'employment_status', 'Classification. Classification': 'position',
    'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service', 
    'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}

In [65]:
### Write your code below this comment.
dete_survey_updated.columns = dete_survey_updated.columns.str.strip().str.lower().str.replace(" ", "_")
tafe_survey_updated.rename(columns={'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment': 'separationtype', 
'Gender. What is your Gender?': 'gender', 'CurrentAge. Current Age': 'age', 
'Employment Type. Employment Type': 'employment_status', 'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service', 
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}, inplace=True)

### Task 5 (2 marks)

Get the number of unique values in the ```'separationtype'``` column of both the dataframes __tafe_survey_updated__ and __dete_survey_updated__. You may notice that there is a single value with the name 'Resignation' in __tafe_survey_updated__ while there are multiple values including the word 'Resignation' in __dete_survey_updated__.

Update the ```'separationtype'``` column in __dete_survey_updated__ such that any value that contains the word 'Resignation' is replaced with 'Resignation'. For example, 'Resignation-Other reasons' should be replaced with 'Resignation' etc.

<u>Hint:</u> All values that include the word 'Resignation' also include a hyphen (-). You can use this fact to quickly update the values.

Once you have done that, create two new updated dataframes __dete_resignations__ and __tafe_resignations__ which copy the dataframes __dete_survey_updated__ and __tafe_survey_updated__ respectively with the condition that the ```'separationtype'``` column in both the source dataframes equals the value 'Resignation'.

In [66]:
### Write your code below this comment.
print('Unique separation types TAFE: \n', tafe_survey_updated['separationtype'].unique(), '\n')
print('There are', len(tafe_survey_updated['separationtype'].unique()), 
      'unique separation types in the TAFE dataframe. \n')
print('Unique separation types DETE: \n', dete_survey_updated['separationtype'].unique(), '\n')
print('There are', len(dete_survey_updated['separationtype'].unique()), 
      'unique separation types in the DETE dataframe.')

Unique separation types TAFE: 
 ['Contract Expired' 'Retirement' 'Resignation' 'Retrenchment/ Redundancy'
 'Termination' 'Transfer' nan] 

There are 7 unique separation types in the TAFE dataframe. 

Unique separation types DETE: 
 ['Ill Health Retirement' 'Voluntary Early Retirement (VER)'
 'Resignation-Other reasons' 'Age Retirement' 'Resignation-Other employer'
 'Resignation-Move overseas/interstate' 'Other' 'Contract Expired'
 'Termination'] 

There are 9 unique separation types in the DETE dataframe.


In [67]:
dete_survey_updated.separationtype = dete_survey_updated.separationtype.apply(lambda x: 'Resignation' if 'Resignation' in x else x)
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype']=='Resignation']
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype']=='Resignation']

### Task 6 (2 marks)

Show the counts of missing values for all the columns in the __tafe_resignations__ dataframe. 

Now go ahead and first store the counts of all the unique values (including missing values) for the ```'employment_status'``` column in a variable named __es_cnts__ and then show the counts. Once you have done that, fill the missing values for the ```'employment_status'``` column using the most frequent non-NA unique value. Finally, show the counts of all the unique values (including missing values) for the ```'employment_status'``` column once again. There shouldn't be any missing values for this column now.

<u>Hint:</u> Use _idxmax()_

In [78]:
### Write your code below this comment.
print('Number of missing values per column in tafe_resignations: \n \n', tafe_resignations.isna().sum(), '\n')

es_cnts = tafe_resignations['employment_status'].value_counts(dropna=False)
print('Unique values in employment_status: \n \n', es_cnts)

Number of missing values per column in tafe_resignations: 
 
 id                                                      0
Institute                                               0
WorkArea                                                0
cease_date                                              5
separationtype                                          0
Contributing Factors. Career Move - Public Sector       8
Contributing Factors. Career Move - Private Sector      8
Contributing Factors. Career Move - Self-employment     8
Contributing Factors. Ill Health                        8
Contributing Factors. Maternity/Family                  8
Contributing Factors. Dissatisfaction                   8
Contributing Factors. Job Dissatisfaction               8
Contributing Factors. Interpersonal Conflict            8
Contributing Factors. Study                             8
Contributing Factors. Travel                            8
Contributing Factors. Other                             8
Contributi

dtype('int64')

In [82]:
tafe_resignations['employment_status'].fillna(es_cnts.idxmax(), inplace=True)
es_cnts_no_na = tafe_resignations['employment_status'].value_counts(dropna=False)
print('Unique values in employment_status with NA filled: \n \n', es_cnts_no_na)

Unique values in employment_status with NA filled: 
 
 Temporary Full-time    161
Permanent Full-time     98
Contract/casual         29
Temporary Part-time     27
Permanent Part-time     25
Name: employment_status, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tafe_resignations['employment_status'].fillna(es_cnts.idxmax(), inplace=True)
