# Clean and Analyze Employee Exit Surveys

Corresponds to DataQuest guided project.

For this project, we are given two exit surveys, from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia, respectively. Our stakeholders want to know the following:

* Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
* Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

We are supposed to combine the results for both surveys to answer these questions. However, although both used the same survey template, one of them customized some of the answers. Bringing the two datasets together requires some data cleaning first.

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from pathlib import Path

In [2]:
data_path = Path.home() / "datasets" / "tabular_practice"

dete_survey = pd.read_csv(data_path / "dete_survey.csv")
tafe_survey = pd.read_csv(data_path / "tafe_survey.csv")

In [3]:
dete_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   ID                                   822 non-null    int64 
 1   SeparationType                       822 non-null    object
 2   Cease Date                           822 non-null    object
 3   DETE Start Date                      822 non-null    object
 4   Role Start Date                      822 non-null    object
 5   Position                             817 non-null    object
 6   Classification                       455 non-null    object
 7   Region                               822 non-null    object
 8   Business Unit                        126 non-null    object
 9   Employment Status                    817 non-null    object
 10  Career move to public sector         822 non-null    bool  
 11  Career move to private sector        822 non-

In [4]:
dete_survey.describe(include=["O"])

Unnamed: 0,SeparationType,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,Professional Development,...,Kept informed,Wellness programs,Health & Safety,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
count,822,822,822,822,817,455,822,126,817,808,...,813,766,793,798,811,16,3,7,23,32
unique,9,25,51,46,15,8,9,14,5,6,...,6,6,6,2,10,1,1,1,1,1
top,Age Retirement,2012,Not Stated,Not Stated,Teacher,Primary,Metropolitan,Education Queensland,Permanent Full-time,A,...,A,A,A,Female,61 or older,Yes,Yes,Yes,Yes,Yes
freq,285,344,73,98,324,161,135,54,434,413,...,401,253,386,573,222,16,3,7,23,32


In [5]:
dete_survey["Cease Date"].value_counts()

Cease Date
2012          344
2013          200
01/2014        43
12/2013        40
Not Stated     34
09/2013        34
06/2013        27
07/2013        22
10/2013        20
11/2013        16
08/2013        12
05/2013         7
05/2012         6
04/2014         2
07/2014         2
08/2012         2
04/2013         2
02/2014         2
11/2012         1
09/2010         1
2010            1
2014            1
07/2012         1
09/2014         1
07/2006         1
Name: count, dtype: int64

In [6]:
dete_survey["Professional Development"].value_counts(dropna=False)

Professional Development
A      413
SA     184
N      103
D       60
SD      33
M       15
NaN     14
Name: count, dtype: int64

In [7]:
tafe_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
 #   Column                                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                                         --------------  -----  
 0   Record ID                                                                                                                                                      702 non-null    float64
 1   Institute                                                                                                                                                      702 non-null    object 
 2   WorkArea                                                                                                                                  

In [45]:
tafe_survey.iloc[:, 31].value_counts()

WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit
Agree                234
Strongly Agree       156
Disagree              87
Neutral               81
Strongly Disagree     43
Not Applicable         8
Name: count, dtype: int64

The first question to be addressed is: "Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?".

We need the following information from the data:

* Which employees resigned?
* Duration of employment of employees (to be grouped into "short period" versus "been there longer")
* Reasons for resignation

In [9]:
dete_survey["SeparationType"].value_counts()

SeparationType
Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: count, dtype: int64

In [10]:
tafe_survey.columns

Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR',
       'Reason for ceasing employment',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Main Factor. Which of these was the main factor for leaving?',
       'InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction',
       'InstituteViews. Topic:2. I was given access to skills training to help me do my job better',
       'InstituteViews. Topic:3. I was given adequate oppo

In [11]:
tafe_survey["Reason for ceasing employment"].value_counts()

Reason for ceasing employment
Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: count, dtype: int64

We first filter for employees who resigned:

* For `dete_survey`: Column "SeparationType", value starting with "Resignation"
* For `tafe_survey`: Column "Reason for ceasing employment", value "Resignation"

In [12]:
dete_resigned = dete_survey[dete_survey["SeparationType"].str.startswith("Resignation")].copy()
tafe_resigned = tafe_survey[tafe_survey["Reason for ceasing employment"] == "Resignation"].copy()

dete_resigned.shape, tafe_resigned.shape

((311, 56), (340, 72))

In [13]:
tafe_resigned["LengthofServiceOverall. Overall Length of Service at Institute (in years)"].value_counts(dropna=False)

LengthofServiceOverall. Overall Length of Service at Institute (in years)
Less than 1 year      73
1-2                   64
3-4                   63
NaN                   50
5-6                   33
11-20                 26
7-10                  21
More than 20 years    10
Name: count, dtype: int64

In [14]:
dete_resigned["Cease Date"].value_counts()

Cease Date
2012          126
2013           74
01/2014        22
12/2013        17
06/2013        14
Not Stated     11
09/2013        11
07/2013         9
11/2013         9
10/2013         6
08/2013         4
05/2012         2
05/2013         2
09/2010         1
07/2012         1
2010            1
07/2006         1
Name: count, dtype: int64

In [15]:
dete_resigned["DETE Start Date"].value_counts()

DETE Start Date
Not Stated    28
2011          24
2008          22
2012          21
2007          21
2010          17
2005          15
2004          14
2006          13
2009          13
2013          10
2000           9
1999           8
1994           6
2003           6
1992           6
1996           6
1998           6
2002           6
1990           5
1997           5
1980           5
1993           5
1989           4
1991           4
1988           4
1995           4
1986           3
2001           3
1985           3
1983           2
1976           2
1974           2
1975           1
1984           1
1971           1
1982           1
1972           1
1963           1
1977           1
1973           1
1987           1
Name: count, dtype: int64

Next, we'd like to determine how long each employee stayed at the respective institute:

* `tafe_resigned`: Column "LengthofServiceOverall...". This variable is broken up into certain bins
* `dete_resigned`: Columns "Cease Date", "DETE Start Date". Both have "Not Stated" for NaN. "Cease Date" sometimes also features the month

We'll create a new column in `dete_resigned` with the same bins as in the `tafe_resigned` column.

In [16]:
tafe_resigned.rename({"LengthofServiceOverall. Overall Length of Service at Institute (in years)": "duration_employment"}, axis=1, inplace=True)
tafe_resigned.columns

Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR',
       'Reason for ceasing employment',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Main Factor. Which of these was the main factor for leaving?',
       'InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction',
       'InstituteViews. Topic:2. I was given access to skills training to help me do my job better',
       'InstituteViews. Topic:3. I was given adequate oppo

In [17]:
tafe_resigned.loc[tafe_resigned["duration_employment"] == "Not Stated", "duration_employment"] = np.nan
tafe_resigned["duration_employment"].value_counts(dropna=False)

duration_employment
Less than 1 year      73
1-2                   64
3-4                   63
NaN                   50
5-6                   33
11-20                 26
7-10                  21
More than 20 years    10
Name: count, dtype: int64

In [18]:
dete_resigned.loc[dete_resigned["Cease Date"] == "Not Stated", "Cease Date"] = np.nan
dete_resigned["Cease Date"].value_counts(dropna=False)

Cease Date
2012       126
2013        74
01/2014     22
12/2013     17
06/2013     14
NaN         11
09/2013     11
07/2013      9
11/2013      9
10/2013      6
08/2013      4
05/2012      2
05/2013      2
09/2010      1
07/2012      1
2010         1
07/2006      1
Name: count, dtype: int64

In [19]:
dete_resigned["cease_date_year"] = dete_resigned["Cease Date"].str.split("/").str[-1].astype(float)

In [20]:
dete_resigned["cease_date_year"].value_counts(dropna=False)

cease_date_year
2013.0    146
2012.0    129
2014.0     22
NaN        11
2010.0      2
2006.0      1
Name: count, dtype: int64

In [21]:
dete_resigned.loc[dete_resigned["DETE Start Date"] == "Not Stated", "DETE Start Date"] = np.nan
dete_resigned["DETE Start Date"].value_counts(dropna=False)

DETE Start Date
NaN     28
2011    24
2008    22
2012    21
2007    21
2010    17
2005    15
2004    14
2006    13
2009    13
2013    10
2000     9
1999     8
1994     6
2003     6
1992     6
1996     6
1998     6
2002     6
1990     5
1997     5
1980     5
1993     5
1989     4
1991     4
1988     4
1995     4
1986     3
2001     3
1985     3
1983     2
1976     2
1974     2
1975     1
1984     1
1971     1
1982     1
1972     1
1963     1
1977     1
1973     1
1987     1
Name: count, dtype: int64

In [22]:
dete_resigned["start_date_year"] = dete_resigned["DETE Start Date"].astype(float)

In [23]:
dete_resigned["years_employed"] = dete_resigned["cease_date_year"] - dete_resigned["start_date_year"]
dete_resigned["years_employed"].value_counts(dropna=False)

years_employed
NaN     38
5.0     23
1.0     22
3.0     20
0.0     20
6.0     17
4.0     16
9.0     14
2.0     14
7.0     13
13.0     8
8.0      8
20.0     7
15.0     7
17.0     6
10.0     6
22.0     6
14.0     6
12.0     6
18.0     5
16.0     5
24.0     4
11.0     4
23.0     4
21.0     3
19.0     3
39.0     3
32.0     3
25.0     2
28.0     2
26.0     2
36.0     2
30.0     2
34.0     1
29.0     1
27.0     1
42.0     1
33.0     1
41.0     1
49.0     1
35.0     1
38.0     1
31.0     1
Name: count, dtype: int64

In [24]:
tafe_resigned["duration_employment"].value_counts(dropna=False)

duration_employment
Less than 1 year      73
1-2                   64
3-4                   63
NaN                   50
5-6                   33
11-20                 26
7-10                  21
More than 20 years    10
Name: count, dtype: int64

In [25]:
bins=[-1, 0, 2, 4, 6, 10, 20, np.inf]
labels=["Less than 1 year", "1-2", "3-4", "5-6", "7-10", "11-20", "More than 20 years"]
dete_resigned["duration_employment"] = pd.cut(dete_resigned["years_employed"], bins=bins, labels=labels)
dete_resigned["duration_employment"].value_counts(dropna=False)

duration_employment
11-20                 57
More than 20 years    43
7-10                  41
5-6                   40
NaN                   38
1-2                   36
3-4                   36
Less than 1 year      20
Name: count, dtype: int64

We found an interesting difference. Most employees who resigned from DETE, worked there more than 11 years, while most employees who resigned from TAFE, worked there less than 2 years. In fact, the ordering of frequencies w.r.t. bins is almost inverted.

Next, we need to look at columns detailing the reason for resignation:

* `dete_resigned`: Boolean columns from 10 to 27
* `tafe_resigned`: Columns starting with "Contributing Factors." and "Main Factor."

Let us see whether these match, and how they are encoded.

In [26]:
dete_resigned.columns[10:28]

Index(['Career move to public sector', 'Career move to private sector',
       'Interpersonal conflicts', 'Job dissatisfaction',
       'Dissatisfaction with the department', 'Physical work environment',
       'Lack of recognition', 'Lack of job security', 'Work location',
       'Employment conditions', 'Maternity/family', 'Relocation',
       'Study/Travel', 'Ill Health', 'Traumatic incident', 'Work life balance',
       'Workload', 'None of the above'],
      dtype='object')

In [27]:
tafe_resigned.columns[tafe_resigned.columns.str.startswith("Contributing Factors.")]

Index(['Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE'],
      dtype='object')

In [28]:
tafe_resigned.iloc[:, 5].value_counts(dropna=False)

Contributing Factors. Career Move - Public Sector 
-                              284
Career Move - Public Sector     48
NaN                              8
Name: count, dtype: int64

In [29]:
tafe_resigned.iloc[:, 6].value_counts(dropna=False)

Contributing Factors. Career Move - Private Sector 
-                               233
Career Move - Private Sector     99
NaN                               8
Name: count, dtype: int64

The encoding of these columns in `tafe_resigned` is odd. It seems that "-" maps to `False`, the other string maps to `True`. Let us convert these values to booleans, as in `dete_resigned`.

In [30]:
slice = tafe_resigned.iloc[:, 5:17].copy()
slice.columns

Index(['Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE'],
      dtype='object')

In [37]:
# Map non-missing values to `True`, then all "-" values to `False`
slice_new = slice.where(slice.isnull(), True).where(slice != "-", False)

In [38]:
slice_new.iloc[:, 0].value_counts(dropna=False)

Contributing Factors. Career Move - Public Sector 
False    284
True      48
NaN        8
Name: count, dtype: int64

In [39]:
tafe_resigned.iloc[:, 5:17] = slice_new
tafe_resigned.iloc[:, 5:17].head()

Unnamed: 0,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,Contributing Factors. Dissatisfaction,Contributing Factors. Job Dissatisfaction,Contributing Factors. Interpersonal Conflict,Contributing Factors. Study,Contributing Factors. Travel,Contributing Factors. Other,Contributing Factors. NONE
3,False,False,False,False,False,False,False,False,False,True,False,False
4,False,True,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,True,False
6,False,True,False,False,True,False,False,False,False,False,True,False
7,False,False,False,False,False,False,False,False,False,False,True,False


In [42]:
# These are factors we find in both surveys:
common_names = [
    "career_move_public_sector",
    "career_move_private_sector",
    "ill_health",
    "interpersonal_conflict",
    "job_dissatisfaction",
    "maternity_family",
    "study_travel",
]

# Let us relabel these column names
dete_names = [
    "Career move to public sector",
    "Career move to private sector",
    "Ill Health",
    "Interpersonal conflicts",
    "Job dissatisfaction",
    "Maternity/family",
    "Study/Travel",    
]
dete_map = dict(zip(dete_names, common_names))
dete_resigned.rename(dete_map, axis=1, inplace=True)

tafe_names = [
    'Contributing Factors. Career Move - Public Sector ',
    'Contributing Factors. Career Move - Private Sector ',
    'Contributing Factors. Ill Health',
    'Contributing Factors. Interpersonal Conflict',
    'Contributing Factors. Job Dissatisfaction',
    'Contributing Factors. Maternity/Family',
]
tafe_map = dict(zip(tafe_names, common_names[:-1]))
tafe_resigned.rename(tafe_map, axis=1, inplace=True)
tafe_resigned[common_names[-1]] = tafe_resigned['Contributing Factors. Study'] | tafe_resigned['Contributing Factors. Travel']

In [43]:
dete_resigned.columns

Index(['ID', 'SeparationType', 'Cease Date', 'DETE Start Date',
       'Role Start Date', 'Position', 'Classification', 'Region',
       'Business Unit', 'Employment Status', 'career_move_public_sector',
       'career_move_private_sector', 'interpersonal_conflict',
       'job_dissatisfaction', 'Dissatisfaction with the department',
       'Physical work environment', 'Lack of recognition',
       'Lack of job security', 'Work location', 'Employment conditions',
       'maternity_family', 'Relocation', 'study_travel', 'ill_health',
       'Traumatic incident', 'Work life balance', 'Workload',
       'None of the above', 'Professional Development',
       'Opportunities for promotion', 'Staff morale', 'Workplace issue',
       'Physical environment', 'Worklife balance',
       'Stress and pressure support', 'Performance of supervisor',
       'Peer support', 'Initiative', 'Skills', 'Coach', 'Career Aspirations',
       'Feedback', 'Further PD', 'Communication', 'My say', 'Information',

In [44]:
tafe_resigned.columns

Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR',
       'Reason for ceasing employment', 'career_move_public_sector',
       'career_move_private_sector',
       'Contributing Factors. Career Move - Self-employment', 'ill_health',
       'maternity_family', 'Contributing Factors. Dissatisfaction',
       'job_dissatisfaction', 'interpersonal_conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Main Factor. Which of these was the main factor for leaving?',
       'InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction',
       'InstituteViews. Topic:2. I was given access to skills training to help me do my job better',
       'InstituteViews. Topic:3. I was given adequate opportunities for personal development',
       'InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%',
       'InstituteVi

There are additional reasons listed in `dete_resigned`, but not in `tafe_resigned`.

In [46]:
dete_resigned.columns[10:28]

Index(['career_move_public_sector', 'career_move_private_sector',
       'interpersonal_conflict', 'job_dissatisfaction',
       'Dissatisfaction with the department', 'Physical work environment',
       'Lack of recognition', 'Lack of job security', 'Work location',
       'Employment conditions', 'maternity_family', 'Relocation',
       'study_travel', 'ill_health', 'Traumatic incident', 'Work life balance',
       'Workload', 'None of the above'],
      dtype='object')

In [47]:
common_names

['career_move_public_sector',
 'career_move_private_sector',
 'ill_health',
 'interpersonal_conflict',
 'job_dissatisfaction',
 'maternity_family',
 'study_travel']

In [48]:
dete_resigned.columns[[14, 15, 16, 17, 18, 19, 21, 24, 25, 26, 27]]

Index(['Dissatisfaction with the department', 'Physical work environment',
       'Lack of recognition', 'Lack of job security', 'Work location',
       'Employment conditions', 'Relocation', 'Traumatic incident',
       'Work life balance', 'Workload', 'None of the above'],
      dtype='object')

In [49]:
tafe_resigned.columns[[7, 10, 16]]

Index(['Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Dissatisfaction', 'Contributing Factors. NONE'],
      dtype='object')

On the other hand, the TAFE survey contains a number of questions which are closely related to additional reasons in the DETE survey.

In [56]:
list(zip(range(18, 48), tafe_resigned.columns[18:48]))

[(18,
  'InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction'),
 (19,
  'InstituteViews. Topic:2. I was given access to skills training to help me do my job better'),
 (20,
  'InstituteViews. Topic:3. I was given adequate opportunities for personal development'),
 (21,
  'InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%'),
 (22,
  'InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had'),
 (23,
  'InstituteViews. Topic:6. The organisation recognised when staff did good work'),
 (24, 'InstituteViews. Topic:7. Management was generally supportive of me'),
 (25,
  'InstituteViews. Topic:8. Management was generally supportive of my team'),
 (26,
  'InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me'),
 (27,
  'InstituteViews. Topic:10. Staff morale was positive within the Institute'),
 (28,
  'InstituteViews. Topic:

In [53]:
tafe_resigned.iloc[:, 18].value_counts(dropna=False)

InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction
Agree                109
Neutral               76
Strongly Agree        48
NaN                   44
Disagree              34
Strongly Disagree     26
Not Applicable         3
Name: count, dtype: int64

Say we can pair a DETE reason for resigning with one or more questions in the TAFE survey, in the sense that the DETE reason is `True` if at least one of the questions is answered "Disagree" or "Strongly Disagree". In this case, we extend the TAFE data by this new reason, and rename the column in the DETE survey.

In [54]:
tafe_resigned.iloc[:, 18].str.contains("Disagree").value_counts(dropna=False)

InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction
False    236
True      60
NaN       44
Name: count, dtype: int64

In [65]:
def create_tafe_reason_resigning(dete_name, tafe_ilocs, new_name):
    tafe_disagree = tafe_resigned.iloc[:, tafe_ilocs].apply(lambda s: s.str.contains("Disagree"))
    tafe_resigned[new_name] = tafe_disagree.any(axis=1)
    dete_resigned.rename({dete_name: new_name}, axis=1, inplace=True)

In [66]:
extra_tafe_reasons = [
    (
        "Physical work environment",
        [44],
        "work_environment",
    ),
    (
        "Lack of recognition",
        [23, 43],
        "lack_recognition",
    ),
    (
        "Work life balance",
        [41, 42],
        "work_life_balance",
    ),
]

for dete_name, tafe_ilocs, new_name in extra_tafe_reasons:
    create_tafe_reason_resigning(dete_name, tafe_ilocs, new_name)

common_names = common_names + [x[-1] for x in extra_tafe_reasons]

In [67]:
common_names

['career_move_public_sector',
 'career_move_private_sector',
 'ill_health',
 'interpersonal_conflict',
 'job_dissatisfaction',
 'maternity_family',
 'study_travel',
 'work_environment',
 'lack_recognition',
 'work_life_balance']

A first coarse answer to the question can be obtained by labeling each employee (who resigned) as dissatisfied or not.

In [70]:
common_dissatisfied_names = ['job_dissatisfaction', 'work_environment', 'lack_recognition', 'work_life_balance']
dete_dissatisfied_names = common_dissatisfied_names + [
    'Lack of job security', 'Work location', 'Employment conditions', 'Workload',
]
tafe_dissatisfied_names = common_dissatisfied_names + ['Contributing Factors. Dissatisfaction']

dete_resigned["dissatisfied"] = dete_resigned.loc[:, dete_dissatisfied_names].any(axis=1)
tafe_resigned["dissatisfied"] = tafe_resigned.loc[:, tafe_dissatisfied_names].any(axis=1)

In [71]:
dete_resigned["dissatisfied"].value_counts(dropna=False)

dissatisfied
False    167
True     144
Name: count, dtype: int64

In [72]:
tafe_resigned["dissatisfied"].value_counts(dropna=False)

dissatisfied
False    178
True     162
Name: count, dtype: int64

In [118]:
def fraction_pos(x):
    return np.round(np.mean(x), 3)

pd.pivot_table(tafe_resigned, "dissatisfied", "duration_employment", aggfunc=[fraction_pos, len], margins=True).sort_values(("fraction_pos", "dissatisfied"))

Unnamed: 0_level_0,fraction_pos,len
Unnamed: 0_level_1,dissatisfied,dissatisfied
duration_employment,Unnamed: 1_level_2,Unnamed: 2_level_2
1-2,0.438,64
Less than 1 year,0.466,73
3-4,0.492,63
All,0.503,290
5-6,0.515,33
11-20,0.538,26
7-10,0.667,21
More than 20 years,0.8,10


For TAFE, more employees resign after a shorter duration than after a longer one. However, the former employees are also less dissatisfied when resigning.

In [82]:
pd.pivot_table(dete_resigned, "dissatisfied", "duration_employment", aggfunc=[fraction_pos, len], margins=True).sort_values(("fraction_pos", "dissatisfied"))

Unnamed: 0_level_0,fraction_pos,len
Unnamed: 0_level_1,dissatisfied,dissatisfied
duration_employment,Unnamed: 1_level_2,Unnamed: 2_level_2
1-2,0.278,36
3-4,0.333,36
All,0.484,273
11-20,0.491,57
Less than 1 year,0.5,20
5-6,0.55,40
More than 20 years,0.581,43
7-10,0.61,41


For DETE, the ratio of dissatisfied resignees is lower (0.484 instead of 0.503). Employees tend to stay longer before resigning. Once more, employees with a longer duration tend to be more dissatisfied when resigning.

Finally, let us combine the two datasets, using only the columns that are shared and useful for the analysis.

In [83]:
dete_resigned.columns

Index(['ID', 'SeparationType', 'Cease Date', 'DETE Start Date',
       'Role Start Date', 'Position', 'Classification', 'Region',
       'Business Unit', 'Employment Status', 'career_move_public_sector',
       'career_move_private_sector', 'interpersonal_conflict',
       'job_dissatisfaction', 'Dissatisfaction with the department',
       'work_environment', 'lack_recognition', 'Lack of job security',
       'Work location', 'Employment conditions', 'maternity_family',
       'Relocation', 'study_travel', 'ill_health', 'Traumatic incident',
       'work_life_balance', 'Workload', 'None of the above',
       'Professional Development', 'Opportunities for promotion',
       'Staff morale', 'Workplace issue', 'Physical environment',
       'Worklife balance', 'Stress and pressure support',
       'Performance of supervisor', 'Peer support', 'Initiative', 'Skills',
       'Coach', 'Career Aspirations', 'Feedback', 'Further PD',
       'Communication', 'My say', 'Information', 'Kept infor

In [84]:
dete_resigned[['Gender', 'Age']].describe(include="all")

Unnamed: 0,Gender,Age
count,302,306
unique,2,10
top,Female,41-45
freq,233,48


In [85]:
dete_resigned['Age'].value_counts()

Age
41-45            48
46-50            42
36-40            41
26-30            35
51-55            32
31-35            29
21-25            29
56-60            26
61 or older      23
20 or younger     1
Name: count, dtype: int64

In [86]:
tafe_resigned.columns

Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR',
       'Reason for ceasing employment', 'career_move_public_sector',
       'career_move_private_sector',
       'Contributing Factors. Career Move - Self-employment', 'ill_health',
       'maternity_family', 'Contributing Factors. Dissatisfaction',
       'job_dissatisfaction', 'interpersonal_conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Main Factor. Which of these was the main factor for leaving?',
       'InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction',
       'InstituteViews. Topic:2. I was given access to skills training to help me do my job better',
       'InstituteViews. Topic:3. I was given adequate opportunities for personal development',
       'InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%',
       'InstituteVi

In [87]:
tafe_resigned[['Gender. What is your Gender?', 'CurrentAge. Current Age']].describe(include="all")

Unnamed: 0,Gender. What is your Gender?,CurrentAge. Current Age
count,290,290
unique,2,9
top,Female,41 45
freq,191,45


In [88]:
tafe_resigned['CurrentAge. Current Age'].value_counts()

CurrentAge. Current Age
41  45           45
46  50           39
51-55            39
21  25           33
36  40           32
31  35           32
26  30           32
56 or older      29
20 or younger     9
Name: count, dtype: int64

In [89]:
dete_resigned['Age'].value_counts()

Age
41-45            48
46-50            42
36-40            41
26-30            35
51-55            32
31-35            29
21-25            29
56-60            26
61 or older      23
20 or younger     1
Name: count, dtype: int64

In [90]:
tafe_resigned['age'] = tafe_resigned['CurrentAge. Current Age'].str.replace("  ", "-")
tafe_resigned['age'].value_counts(dropna=False)

age
NaN              50
41-45            45
46-50            39
51-55            39
21-25            33
36-40            32
31-35            32
26-30            32
56 or older      29
20 or younger     9
Name: count, dtype: int64

In [91]:
tafe_resigned['gender'] = tafe_resigned['Gender. What is your Gender?']

In [107]:
dete_map = {"Age": "age", "Gender": "gender"}
dete_resigned.rename(dete_map, axis=1, inplace=True)

# Combine two groups, in order to get the same grouping as in the TAFE data
ind = (dete_resigned["age"] == "56-60") | (dete_resigned["age"] == "61 or older")
dete_resigned.loc[ind, "age"] = "56 or older"

In [108]:
dete_names = dete_resigned.columns
ind = dete_names == dete_names.str.lower()
dete_names[ind]

Index(['career_move_public_sector', 'career_move_private_sector',
       'interpersonal_conflict', 'job_dissatisfaction', 'work_environment',
       'lack_recognition', 'maternity_family', 'study_travel', 'ill_health',
       'work_life_balance', 'gender', 'age', 'cease_date_year',
       'start_date_year', 'years_employed', 'duration_employment',
       'dissatisfied'],
      dtype='object')

In [109]:
dete_res_selected = dete_resigned[dete_names[ind]].copy()
dete_res_selected["institute"] = "DETE"

In [110]:
dete_res_selected.head()

Unnamed: 0,career_move_public_sector,career_move_private_sector,interpersonal_conflict,job_dissatisfaction,work_environment,lack_recognition,maternity_family,study_travel,ill_health,work_life_balance,gender,age,cease_date_year,start_date_year,years_employed,duration_employment,dissatisfied,institute
3,False,True,False,False,False,False,False,False,False,False,Female,36-40,2012.0,2005.0,7.0,7-10,False,DETE
5,False,True,False,False,False,False,True,False,False,False,Female,41-45,2012.0,1994.0,18.0,11-20,True,DETE
8,False,True,False,False,False,False,False,False,False,False,Female,31-35,2012.0,2009.0,3.0,3-4,False,DETE
9,False,False,True,True,False,False,False,False,False,False,Female,46-50,2012.0,1997.0,15.0,11-20,True,DETE
11,False,False,False,False,False,False,True,False,False,False,Male,31-35,2012.0,2009.0,3.0,3-4,False,DETE


In [111]:
tafe_names = tafe_resigned.columns
ind = tafe_names == tafe_names.str.lower()
tafe_names[ind]

Index(['career_move_public_sector', 'career_move_private_sector', 'ill_health',
       'maternity_family', 'job_dissatisfaction', 'interpersonal_conflict',
       'duration_employment', 'study_travel', 'work_environment',
       'lack_recognition', 'work_life_balance', 'dissatisfied', 'age',
       'gender'],
      dtype='object')

In [112]:
tafe_res_selected = tafe_resigned[tafe_names[ind]].copy()
tafe_res_selected["institute"] = "TAFE"

In [103]:
tafe_res_selected.head()

Unnamed: 0,career_move_public_sector,career_move_private_sector,ill_health,maternity_family,job_dissatisfaction,interpersonal_conflict,duration_employment,study_travel,work_environment,lack_recognition,work_life_balance,dissatisfied,age,gender,institute
3,False,False,False,False,False,False,,True,False,False,False,False,,,TAFE
4,False,True,False,False,False,False,3-4,False,False,False,False,False,41-45,Male,TAFE
5,False,False,False,False,False,False,7-10,False,True,False,True,True,56 or older,Female,TAFE
6,False,True,False,True,False,False,3-4,False,False,False,False,False,20 or younger,Male,TAFE
7,False,False,False,False,False,False,3-4,False,False,False,True,True,46-50,Male,TAFE


In [113]:
combined = pd.concat([dete_res_selected, tafe_res_selected], ignore_index=True)
combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 651 entries, 0 to 650
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   career_move_public_sector   643 non-null    object 
 1   career_move_private_sector  643 non-null    object 
 2   interpersonal_conflict      643 non-null    object 
 3   job_dissatisfaction         643 non-null    object 
 4   work_environment            651 non-null    bool   
 5   lack_recognition            651 non-null    bool   
 6   maternity_family            643 non-null    object 
 7   study_travel                651 non-null    bool   
 8   ill_health                  643 non-null    object 
 9   work_life_balance           651 non-null    bool   
 10  gender                      592 non-null    object 
 11  age                         596 non-null    object 
 12  cease_date_year             300 non-null    float64
 13  start_date_year             283 non

In [105]:
pd.pivot_table(combined, "dissatisfied", "duration_employment", aggfunc=[fraction_pos, len], margins=True).sort_values(("fraction_pos", "dissatisfied"))

Unnamed: 0_level_0,fraction_pos,len
Unnamed: 0_level_1,dissatisfied,dissatisfied
duration_employment,Unnamed: 1_level_2,Unnamed: 2_level_2
1-2,0.38,100
3-4,0.434,99
Less than 1 year,0.473,93
All,0.494,563
11-20,0.506,83
5-6,0.534,73
More than 20 years,0.623,53
7-10,0.629,62


Once we combined the datasets from the two surveys, we see that:

* There are more employees who resign earlier rather than later (except "11-20")
* The ratio of dissatisfied employees is smaller for those who resign earlier

TODO: Could try to plot some histograms, to understand which factors are most relevant for each duration group

And what about grouping by age?

In [114]:
pd.pivot_table(combined, "dissatisfied", "age", aggfunc=[fraction_pos, len], margins=True).sort_values(("fraction_pos", "dissatisfied"))

Unnamed: 0_level_0,fraction_pos,len
Unnamed: 0_level_1,dissatisfied,dissatisfied
age,Unnamed: 1_level_2,Unnamed: 2_level_2
21-25,0.323,62
20 or younger,0.4,10
41-45,0.462,93
All,0.487,596
36-40,0.493,73
46-50,0.506,81
51-55,0.507,71
31-35,0.525,61
26-30,0.537,67
56 or older,0.538,78


Employees up to 25 years old are the least dissatisfied when they resign, but also "41-45" is below average, and while "56 or older" are most dissatisfied, "26-30" is not far off. Maybe, some histograms can help to understand the most important factors?

In [117]:
pd.pivot_table(combined, "dissatisfied", ["age", "gender"], aggfunc=[fraction_pos, len], margins=True).sort_values(("fraction_pos", "dissatisfied"))

Unnamed: 0_level_0,Unnamed: 1_level_0,fraction_pos,len
Unnamed: 0_level_1,Unnamed: 1_level_1,dissatisfied,dissatisfied
age,gender,Unnamed: 2_level_2,Unnamed: 3_level_2
20 or younger,Male,0.0,3
21-25,Male,0.176,17
21-25,Female,0.378,45
41-45,Female,0.42,69
51-55,Female,0.444,45
36-40,Female,0.46,50
All,,0.488,590
46-50,Male,0.5,18
56 or older,Female,0.511,47
46-50,Female,0.516,62


There are some interesting differences between the genders.

In [119]:
combined.isnull().sum()

career_move_public_sector       8
career_move_private_sector      8
interpersonal_conflict          8
job_dissatisfaction             8
work_environment                0
lack_recognition                0
maternity_family                8
study_travel                    0
ill_health                      8
work_life_balance               0
gender                         59
age                            55
cease_date_year               351
start_date_year               368
years_employed                378
duration_employment            88
dissatisfied                    0
institute                       0
dtype: int64

TODO: Identify and deal with missing values.