In this guided project, we'll work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. You can find the TAFE exit survey here and the survey for the DETE here. We've made some slight modifications to these datasets to make them easier to work with, including changing the encoding to UTF-8 (the original ones are encoded using cp1252.)

In this project, we'll play the role of data analyst and pretend our stakeholders want to know the following:

Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

Below is a preview of a couple columns we'll work with from the dete_survey.csv:

ID: An id used to identify the participant of the survey
SeparationType: The reason why the person's employment ended
Cease Date: The year or month the person's employment ended
DETE Start Date: The year the person began employment with the DETE
Below is a preview of a couple columns we'll work with from the tafe_survey.csv:

Record ID: An id used to identify the participant of the survey
Reason for ceasing employment: The reason why the person's employment ended
LengthofServiceOverall. Overall Length of Service at Institute (in years): The length of the person's employment (in years)


In [360]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [361]:
dete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated')
tafe_survey = pd.read_csv('tafe_survey.csv')

In [362]:
dete_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             788 non-null object
DETE Start Date                        749 non-null float64
Role Start Date                        724 non-null float64
Position                               817 non-null object
Classification                         455 non-null object
Region                                 717 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work envir

In [363]:
dete_survey.head()

Unnamed: 0,ID,SeparationType,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,...,Kept informed,Wellness programs,Health & Safety,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
0,1,Ill Health Retirement,08/2012,1984.0,2004.0,Public Servant,A01-A04,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,N,N,N,Male,56-60,,,,,Yes
1,2,Voluntary Early Retirement (VER),08/2012,,,Public Servant,AO5-AO7,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,N,N,N,Male,56-60,,,,,
2,3,Voluntary Early Retirement (VER),05/2012,2011.0,2011.0,Schools Officer,,Central Office,Education Queensland,Permanent Full-time,...,N,N,N,Male,61 or older,,,,,
3,4,Resignation-Other reasons,05/2012,2005.0,2006.0,Teacher,Primary,Central Queensland,,Permanent Full-time,...,A,N,A,Female,36-40,,,,,
4,5,Age Retirement,05/2012,1970.0,1989.0,Head of Curriculum/Head of Special Education,,South East,,Permanent Full-time,...,N,A,M,Female,61 or older,,,,,


In [364]:
dete_survey.isnull().sum().sort_values(ascending=False)

Torres Strait                          819
South Sea                              815
Aboriginal                             806
Disability                             799
NESB                                   790
Business Unit                          696
Classification                         367
Region                                 105
Role Start Date                         98
Opportunities for promotion             87
Career Aspirations                      76
DETE Start Date                         73
Wellness programs                       56
Coach                                   55
Further PD                              54
Cease Date                              34
Workplace issue                         34
Feedback                                30
Health & Safety                         29
Gender                                  24
Professional Development                14
Stress and pressure support             12
Skills                                  11
Age        

It seems there are many columns that contain mainly null values and thus they can be removed. They will not impact the anaysis results

Torres Strait                          819
South Sea                              815
Aboriginal                             806
Disability                             799
NESB                                   790
Business Unit                          696


In [365]:
tafe_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
Record ID                                                                                                                                                        702 non-null float64
Institute                                                                                                                                                        702 non-null object
WorkArea                                                                                                                                                         702 non-null object
CESSATION YEAR                                                                                                                                                   695 non-null float64
Reason for ceasing employment                                                                                                                                    701 non-

In [366]:
tafe_survey.head()

Unnamed: 0,Record ID,Institute,WorkArea,CESSATION YEAR,Reason for ceasing employment,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,...,Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?,Workplace. Topic:Does your workplace promote and practice the principles of employment equity?,Workplace. Topic:Does your workplace value the diversity of its employees?,Workplace. Topic:Would you recommend the Institute as an employer to others?,Gender. What is your Gender?,CurrentAge. Current Age,Employment Type. Employment Type,Classification. Classification,LengthofServiceOverall. Overall Length of Service at Institute (in years),LengthofServiceCurrent. Length of Service at current workplace (in years)
0,6.34133e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2010.0,Contract Expired,,,,,,...,Yes,Yes,Yes,Yes,Female,26 30,Temporary Full-time,Administration (AO),1-2,1-2
1,6.341337e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Retirement,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
2,6.341388e+17,Mount Isa Institute of TAFE,Delivery (teaching),2010.0,Retirement,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,...,Yes,Yes,Yes,Yes,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4


In [367]:
tafe_survey.isnull().sum().sort_values(ascending=False)

Main Factor. Which of these was the main factor for leaving?                                                                                                     589
InductionInfo. Topic:Did you undertake a Corporate Induction?                                                                                                    270
Contributing Factors. Ill Health                                                                                                                                 265
Contributing Factors. Maternity/Family                                                                                                                           265
Contributing Factors. Career Move - Public Sector                                                                                                                265
Contributing Factors. NONE                                                                                                                                       265
Contributi

In [368]:
tafe_survey['Main Factor. Which of these was the main factor for leaving?'].value_counts()

Dissatisfaction with %[Institute]Q25LBL%    23
Job Dissatisfaction                         22
Other                                       18
Career Move - Private Sector                16
Interpersonal Conflict                       9
Career Move - Public Sector                  8
Maternity/Family                             6
Career Move - Self-employment                4
Ill Health                                   3
Travel                                       2
Study                                        2
Name: Main Factor. Which of these was the main factor for leaving?, dtype: int64

it seems there is du[licate data that is found both in the Main Factor. Which of these was the main factor for leaving? column and also as differentcolumns each in a seperate reason for leaving. we might need to combine these to a single column so we can aggregate the data and understand the reasons for leaving

Now we will drop some of the columsn that we don't need for this analysis

In [369]:
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)

In [370]:
dete_survey_updated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 35 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             788 non-null object
DETE Start Date                        749 non-null float64
Role Start Date                        724 non-null float64
Position                               817 non-null object
Classification                         455 non-null object
Region                                 717 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work envir

In [371]:
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66],axis=1)
tafe_survey_updated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 23 columns):
Record ID                                                                    702 non-null float64
Institute                                                                    702 non-null object
WorkArea                                                                     702 non-null object
CESSATION YEAR                                                               695 non-null float64
Reason for ceasing employment                                                701 non-null object
Contributing Factors. Career Move - Public Sector                            437 non-null object
Contributing Factors. Career Move - Private Sector                           437 non-null object
Contributing Factors. Career Move - Self-employment                          437 non-null object
Contributing Factors. Ill Health                                             437 non-null object
Contributing Factors

Now we will align an normalize the column names so it will be easier to combine to 2 data frames

In [372]:
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')
dete_survey_updated.head()

Unnamed: 0,id,separationtype,cease_date,dete_start_date,role_start_date,position,classification,region,business_unit,employment_status,...,work_life_balance,workload,none_of_the_above,gender,age,aboriginal,torres_strait,south_sea,disability,nesb
0,1,Ill Health Retirement,08/2012,1984.0,2004.0,Public Servant,A01-A04,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,False,False,True,Male,56-60,,,,,Yes
1,2,Voluntary Early Retirement (VER),08/2012,,,Public Servant,AO5-AO7,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,False,False,False,Male,56-60,,,,,
2,3,Voluntary Early Retirement (VER),05/2012,2011.0,2011.0,Schools Officer,,Central Office,Education Queensland,Permanent Full-time,...,False,False,True,Male,61 or older,,,,,
3,4,Resignation-Other reasons,05/2012,2005.0,2006.0,Teacher,Primary,Central Queensland,,Permanent Full-time,...,False,False,False,Female,36-40,,,,,
4,5,Age Retirement,05/2012,1970.0,1989.0,Head of Curriculum/Head of Special Education,,South East,,Permanent Full-time,...,True,False,False,Female,61 or older,,,,,


In [373]:
changed_names = {'Record ID': 'id',
                 'CESSATION YEAR' : 'cease_date',
                 'Reason for ceasing employment': 'separationtype',
                 'Gender. What is your Gender?': 'gender',
                 'CurrentAge. Current Age': 'age',
                 'Employment Type. Employment Type': 'employment_status',
                 'Classification. Classification': 'position',
                 'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
                 'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}

In [374]:
tafe_survey_updated.rename(changed_names,axis=1,inplace=True)
tafe_survey_updated.columns = tafe_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')
tafe_survey_updated.head()


Unnamed: 0,id,institute,workarea,cease_date,separationtype,contributing_factors._career_move_-_public_sector,contributing_factors._career_move_-_private_sector,contributing_factors._career_move_-_self-employment,contributing_factors._ill_health,contributing_factors._maternity/family,...,contributing_factors._study,contributing_factors._travel,contributing_factors._other,contributing_factors._none,gender,age,employment_status,position,institute_service,role_service
0,6.34133e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2010.0,Contract Expired,,,,,,...,,,,,Female,26 30,Temporary Full-time,Administration (AO),1-2,1-2
1,6.341337e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Retirement,-,-,-,-,-,...,-,Travel,-,-,,,,,,
2,6.341388e+17,Mount Isa Institute of TAFE,Delivery (teaching),2010.0,Retirement,-,-,-,-,-,...,-,-,-,NONE,,,,,,
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,...,-,Travel,-,-,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,...,-,-,-,-,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4


In [375]:
dete_survey_updated['separationtype'].value_counts(dropna=False)


Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64

In [376]:
tafe_survey_updated['separationtype'].value_counts(dropna=False)

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
NaN                           1
Name: separationtype, dtype: int64

In [377]:
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'].notnull() & (tafe_survey_updated['separationtype'].str.contains('Resignation'))].copy()

In [378]:
tafe_resignations.shape

(340, 23)

In [379]:
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'].str.contains('Resignation')].copy()

In [380]:
dete_resignations.shape

(311, 35)

Since tafe included one Null value we gon an error in the check for 'Resignation'. So we added additional check for non null values. now we have 2 data frames that include only the specific data we need for our analysis

In this step, we'll focus on verifying that the years in the cease_date and dete_start_date columns make sense. However, we encourage you to check the data for other issues as well!

Since the cease_date is the last year of the person's employment and the dete_start_date is the person's first year of employment, it wouldn't make sense to have years after the current date.
Given that most people in this field start working in their 20s, it's also unlikely that the dete_start_date was before the year 1940.
If we have many years higher than the current date or lower than 1940, we wouldn't want to continue with our analysis, because it could mean there's something very wrong with the data. If there are a small amount of values that are unrealistically high or low, we can remove them.

In [381]:
dete_resignations['cease_date'].value_counts()

2012       126
2013        74
01/2014     22
12/2013     17
06/2013     14
09/2013     11
07/2013      9
11/2013      9
10/2013      6
08/2013      4
05/2013      2
05/2012      2
07/2012      1
2010         1
09/2010      1
07/2006      1
Name: cease_date, dtype: int64

In [382]:
pattern = r"([12][0-9]{3})"
dete_resignations['cease_date'].str.extract(pattern).value_counts(dropna=False)

  from ipykernel import kernelapp as app


2013    146
2012    129
2014     22
NaN      11
2010      2
2006      1
Name: cease_date, dtype: int64

In [383]:
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.extract(pattern)

  if __name__ == '__main__':


In [384]:
dete_resignations['cease_date'] = dete_resignations['cease_date'].astype(float)


In [385]:
dete_resignations['dete_start_date'].value_counts(dropna=False)

NaN        28
 2011.0    24
 2008.0    22
 2007.0    21
 2012.0    21
 2010.0    17
 2005.0    15
 2004.0    14
 2009.0    13
 2006.0    13
 2013.0    10
 2000.0     9
 1999.0     8
 1998.0     6
 2002.0     6
 1994.0     6
 1996.0     6
 1992.0     6
 2003.0     6
 1980.0     5
 1990.0     5
 1993.0     5
 1997.0     5
 1989.0     4
 1995.0     4
 1988.0     4
 1991.0     4
 2001.0     3
 1986.0     3
 1985.0     3
 1976.0     2
 1983.0     2
 1974.0     2
 1963.0     1
 1972.0     1
 1984.0     1
 1975.0     1
 1973.0     1
 1987.0     1
 1982.0     1
 1971.0     1
 1977.0     1
Name: dete_start_date, dtype: int64

In [386]:
tafe_resignations['cease_date'].value_counts()

2011.0    116
2012.0     94
2010.0     68
2013.0     55
2009.0      2
Name: cease_date, dtype: int64

In [387]:
#tafe_resignations.info()
dete_resignations.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 311 entries, 3 to 821
Data columns (total 35 columns):
id                                     311 non-null int64
separationtype                         311 non-null object
cease_date                             300 non-null float64
dete_start_date                        283 non-null float64
role_start_date                        271 non-null float64
position                               308 non-null object
classification                         161 non-null object
region                                 265 non-null object
business_unit                          32 non-null object
employment_status                      307 non-null object
career_move_to_public_sector           311 non-null bool
career_move_to_private_sector          311 non-null bool
interpersonal_conflicts                311 non-null bool
job_dissatisfaction                    311 non-null bool
dissatisfaction_with_the_department    311 non-null bool
physical_work_envir

In [388]:
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']

In [389]:
tafe_resignations['contributing_factors._dissatisfaction'].value_counts()

-                                         277
Contributing Factors. Dissatisfaction      55
Name: contributing_factors._dissatisfaction, dtype: int64

In [390]:
#tafe_resignations['contributing_factors._job_dissatisfaction'].value_counts()
dete_resignations['workload'].value_counts()

False    284
True      27
Name: workload, dtype: int64

In this section we will clean the data and look for dissatisfaction factors. 
For that we will create a function that will return true if one of 
the dissatisfaction answers exist, False if not and NAN if no value exist.
We will also list all the columns on which this function should be apply and then apply it on all of them

In [391]:

cols_tafe = ['contributing_factors._dissatisfaction','contributing_factors._job_dissatisfaction']

def update_vals(element):
        
        if pd.isnull(element):
            return np.nan
        elif element == "-":
            return False
        else:
            return True
                
        
tafe_resignations['dissatisfied'] = tafe_resignations[cols_tafe].applymap(update_vals).any(axis=1, skipna=False).copy()

In [392]:
tafe_resignations['dissatisfied'].value_counts(dropna=False)

False    241
True      91
NaN        8
Name: dissatisfied, dtype: int64

In [393]:


def dis_reason(element):
    
    if pd.isnull(element):
        return np.nan
    elif element in ('job_dissatisfaction','dissatisfaction_with_the_department',
            'physical_work_environment','lack_of_recognition',
            'lack_of_job_security','work_location',
            'employment_conditions','work_life_balance','workload'):
        return True
    else:
        return False

dete_resignations['dissatisfied'] = dete_resignations['separationtype'].apply(dis_reason).copy()
    

In [394]:
dete_resignations['separationtype'].value_counts(dropna=False)

Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Name: separationtype, dtype: int64

In [395]:
dete_resignations['dissatisfied'].value_counts(dropna=False)

False    311
Name: dissatisfied, dtype: int64

In [397]:
dete_resignations_up = dete_resignations
tafe_resignations_up = tafe_resignations

In [398]:
dete_resignations_up['institue'] = 'DETE'
tafe_resignations_up['institue'] = 'TAFE'

In [400]:
dete_resignations_up.dropna(thresh=500, inplace=True)
tafe_resignations_up.dropna(thresh=500, inplace=True)

In [402]:
dete_resignations_up.isnull().sum()
tafe_resignations_up.isnull().sum()

id                                                     0
institute                                              0
workarea                                               0
cease_date                                             0
separationtype                                         0
contributing_factors._career_move_-_public_sector      0
contributing_factors._career_move_-_private_sector     0
contributing_factors._career_move_-_self-employment    0
contributing_factors._ill_health                       0
contributing_factors._maternity/family                 0
contributing_factors._dissatisfaction                  0
contributing_factors._job_dissatisfaction              0
contributing_factors._interpersonal_conflict           0
contributing_factors._study                            0
contributing_factors._travel                           0
contributing_factors._other                            0
contributing_factors._none                             0
gender                         

In [406]:
tafe_resignations_up.rename(columns= {'record_id': 'id','reason_for_ceasing_employment': 'separationtype', 'cessation_year': 'cease_date','currentage.currentage' : 'age', 'gender.what_is_your_gender?': 'gender' }, inplace=True)

In [410]:
tafe_resignations_up.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 25 columns):
id                                                     0 non-null float64
institute                                              0 non-null object
workarea                                               0 non-null object
cease_date                                             0 non-null float64
separationtype                                         0 non-null object
contributing_factors._career_move_-_public_sector      0 non-null object
contributing_factors._career_move_-_private_sector     0 non-null object
contributing_factors._career_move_-_self-employment    0 non-null object
contributing_factors._ill_health                       0 non-null object
contributing_factors._maternity/family                 0 non-null object
contributing_factors._dissatisfaction                  0 non-null object
contributing_factors._job_dissatisfaction              0 non-null object
contributing_factors._interpe

In [411]:
combined_updated = pd.merge(left=dete_resignations_up, right=tafe_resignations_up, how='left', on='id')

In [414]:
#combined_updated.info()
dete_resignations_up.shape

(0, 38)