# DETE Employee Exit Survey Access Database
In this project, we'll play the role of data analyst and pretend our stakeholders want to know the following:
* Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
* Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

dete_survey = pd.read_csv('dete-exit-survey-january-2014.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')


In [3]:
display(dete_survey.head(10))

Unnamed: 0,ID,SeparationType,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,...,Kept informed,Wellness programs,Health & Safety,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
0,1,Ill Health Retirement,08/2012,1984,2004,Public Servant,A01-A04,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,N,N,N,Male,56-60,,,,,Yes
1,2,Voluntary Early Retirement (VER),08/2012,Not Stated,Not Stated,Public Servant,AO5-AO7,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,N,N,N,Male,56-60,,,,,
2,3,Voluntary Early Retirement (VER),05/2012,2011,2011,Schools Officer,,Central Office,Education Queensland,Permanent Full-time,...,N,N,N,Male,61 or older,,,,,
3,4,Resignation-Other reasons,05/2012,2005,2006,Teacher,Primary,Central Queensland,,Permanent Full-time,...,A,N,A,Female,36-40,,,,,
4,5,Age Retirement,05/2012,1970,1989,Head of Curriculum/Head of Special Education,,South East,,Permanent Full-time,...,N,A,M,Female,61 or older,,,,,
5,6,Resignation-Other reasons,05/2012,1994,1997,Guidance Officer,,Central Office,Education Queensland,Permanent Full-time,...,D,D,,Female,41-45,,,,,
6,7,Age Retirement,05/2012,1972,2007,Teacher,Secondary,Darling Downs South West,,Permanent Part-time,...,D,D,SD,Female,56-60,,,,,
7,8,Age Retirement,05/2012,1988,1990,Teacher Aide,,North Coast,,Permanent Part-time,...,SA,,SA,Female,61 or older,,,,,
8,9,Resignation-Other reasons,07/2012,2009,2009,Teacher,Secondary,North Queensland,,Permanent Full-time,...,A,D,N,Female,31-35,,,,,
9,10,Resignation-Other employer,2012,1997,2008,Teacher Aide,,Not Stated,,Permanent Part-time,...,SD,SD,SD,Female,46-50,,,,,


In [4]:
dete_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             822 non-null object
DETE Start Date                        822 non-null object
Role Start Date                        822 non-null object
Position                               817 non-null object
Classification                         455 non-null object
Region                                 822 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work environ

In [5]:
dete_survey.isnull()

Unnamed: 0,ID,SeparationType,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,...,Kept informed,Wellness programs,Health & Safety,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,True,True,True,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,True,True,True,True
2,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,True,True,True,True,True
3,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,True,True,True,True,True
4,False,False,False,False,False,False,True,False,True,False,...,False,False,False,False,False,True,True,True,True,True
5,False,False,False,False,False,False,True,False,False,False,...,False,False,True,False,False,True,True,True,True,True
6,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,True,True,True,True,True
7,False,False,False,False,False,False,True,False,True,False,...,False,True,False,False,False,True,True,True,True,True
8,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,True,True,True,True,True
9,False,False,False,False,False,False,True,False,True,False,...,False,False,False,False,False,True,True,True,True,True


In [6]:
dete_survey['SeparationType'].value_counts()

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: SeparationType, dtype: int64

In [7]:
dete_survey=pd.read_csv("dete-exit-survey-january-2014.csv", na_values='Not Stated')
dete_survey_updated=dete_survey.drop(dete_survey.columns[28:49],axis=1)

From our work in the previous screen, we can first make the following observations:
* The dete_survey dataframe contains 'Not Stated' values that indicate values are missing, but they aren't represented as NaN'
* The dete_survey contains many columns that we don't need to complete our analysis.
* The dete_survey contains many of the same columns, but the column names are different.
* There are multiple columns/answers that indicate an employee resigned because they were dissatisfied.

In [8]:
dete_survey_updated.rename({'SeparationType': 'separation_type'},axis=1,inplace=True)

In [9]:
dete_survey_updated.head()

Unnamed: 0,ID,separation_type,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,...,Work life balance,Workload,None of the above,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
0,1,Ill Health Retirement,08/2012,1984.0,2004.0,Public Servant,A01-A04,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,False,False,True,Male,56-60,,,,,Yes
1,2,Voluntary Early Retirement (VER),08/2012,,,Public Servant,AO5-AO7,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,False,False,False,Male,56-60,,,,,
2,3,Voluntary Early Retirement (VER),05/2012,2011.0,2011.0,Schools Officer,,Central Office,Education Queensland,Permanent Full-time,...,False,False,True,Male,61 or older,,,,,
3,4,Resignation-Other reasons,05/2012,2005.0,2006.0,Teacher,Primary,Central Queensland,,Permanent Full-time,...,False,False,False,Female,36-40,,,,,
4,5,Age Retirement,05/2012,1970.0,1989.0,Head of Curriculum/Head of Special Education,,South East,,Permanent Full-time,...,True,False,False,Female,61 or older,,,,,


In [10]:
dete_survey_updated.columns=dete_survey_updated.columns.str.strip().str.replace(' ','_').str.lower()

In [11]:
dete_survey_updated.columns

Index(['id', 'separation_type', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')

In [12]:
dete_survey_updated.duplicated(['id']).value_counts()

False    822
dtype: int64

In [13]:
dete_survey_updated['separation_type'].value_counts(dropna=False)

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separation_type, dtype: int64

Separation due to all resignation types is around 53% followed by Age Retirement at around 35%. Now we will see that resignation in tafe_survey is just less than 50 %. The high proportions of resignation types needs some answers which we will try to answer after preparing our data.

In [14]:
dete_survey_updated['separation_type'].value_counts()

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separation_type, dtype: int64

In [15]:
dete_survey_updated['separationtype']=dete_survey_updated['separation_type'].str.split('-').str[0]
dete_resignation=dete_survey_updated.copy()[dete_survey_updated['separationtype'].str.contains(r'Resignation')]

In [16]:
display(dete_resignation)

Unnamed: 0,id,separation_type,cease_date,dete_start_date,role_start_date,position,classification,region,business_unit,employment_status,...,workload,none_of_the_above,gender,age,aboriginal,torres_strait,south_sea,disability,nesb,separationtype
3,4,Resignation-Other reasons,05/2012,2005.0,2006.0,Teacher,Primary,Central Queensland,,Permanent Full-time,...,False,False,Female,36-40,,,,,,Resignation
5,6,Resignation-Other reasons,05/2012,1994.0,1997.0,Guidance Officer,,Central Office,Education Queensland,Permanent Full-time,...,False,False,Female,41-45,,,,,,Resignation
8,9,Resignation-Other reasons,07/2012,2009.0,2009.0,Teacher,Secondary,North Queensland,,Permanent Full-time,...,False,False,Female,31-35,,,,,,Resignation
9,10,Resignation-Other employer,2012,1997.0,2008.0,Teacher Aide,,,,Permanent Part-time,...,False,False,Female,46-50,,,,,,Resignation
11,12,Resignation-Move overseas/interstate,2012,2009.0,2009.0,Teacher,Secondary,Far North Queensland,,Permanent Full-time,...,False,False,Male,31-35,,,,,,Resignation
12,13,Resignation-Other reasons,2012,1998.0,1998.0,Teacher,Primary,Far North Queensland,,Permanent Full-time,...,False,False,Female,36-40,,,,,,Resignation
14,15,Resignation-Other employer,2012,2007.0,2010.0,Teacher,Secondary,Central Queensland,,Permanent Full-time,...,False,False,Male,31-35,,,,,,Resignation
16,17,Resignation-Other reasons,2012,,,Teacher Aide,,South East,,Permanent Part-time,...,False,False,Male,61 or older,,,,,,Resignation
20,21,Resignation-Other employer,2012,1982.0,1982.0,Teacher,Secondary,Central Queensland,,Permanent Full-time,...,False,True,Male,56-60,,,,,,Resignation
21,22,Resignation-Other reasons,2012,1980.0,2009.0,Cleaner,,Darling Downs South West,,Permanent Part-time,...,False,False,Female,51-55,,,,,,Resignation


## Note that dete_survey_updated dataframe contains multiple separation types with the string 'Resignation':
* Resignation-Other reasons
* Resignation-Other employer
* Resignation-Move overseas/interstate

## Data Verification / Checking Validity
Now we will check the validity of dates and years in our dataset. Starting with `cease_date`:, the last year of a person's employment should not be more than the current year(the year the data was created). Similarly,for `date_start_date` , we can say that people working here wont be over 60, and reasoning that they started working their in their 20's, we can say that the least value for this column can be around 1940.


In [17]:
dete_survey_updated['cease_date'].value_counts()

2012       344
2013       200
01/2014     43
12/2013     40
09/2013     34
06/2013     27
07/2013     22
10/2013     20
11/2013     16
08/2013     12
05/2013      7
05/2012      6
04/2014      2
04/2013      2
07/2014      2
02/2014      2
08/2012      2
2010         1
2014         1
07/2006      1
07/2012      1
09/2014      1
09/2010      1
11/2012      1
Name: cease_date, dtype: int64

In [18]:
dete_resignation['cease_date']=dete_resignation['cease_date'].str.extract(r'(?P<Years>[1-2][0-9]{3})')
dete_resignation['cease_date']=dete_resignation['cease_date'].astype(float)
dete_resignation['cease_date'].value_counts()

2013.0    146
2012.0    129
2014.0     22
2010.0      2
2006.0      1
Name: cease_date, dtype: int64

In [19]:
dete_resignation['dete_start_date']=dete_resignation['dete_start_date'].astype(float)
dete_resignation['dete_start_date'].value_counts().sort_index(ascending=True)

1963.0     1
1971.0     1
1972.0     1
1973.0     1
1974.0     2
1975.0     1
1976.0     2
1977.0     1
1980.0     5
1982.0     1
1983.0     2
1984.0     1
1985.0     3
1986.0     3
1987.0     1
1988.0     4
1989.0     4
1990.0     5
1991.0     4
1992.0     6
1993.0     5
1994.0     6
1995.0     4
1996.0     6
1997.0     5
1998.0     6
1999.0     8
2000.0     9
2001.0     3
2002.0     6
2003.0     6
2004.0    14
2005.0    15
2006.0    13
2007.0    21
2008.0    22
2009.0    13
2010.0    17
2011.0    24
2012.0    21
2013.0    10
Name: dete_start_date, dtype: int64

In [20]:
dete_resignation['institute_service'] = (dete_resignation['cease_date'] - dete_resignation['dete_start_date'])
dete_resignation['institute_service'].value_counts()

5.0     23
1.0     22
3.0     20
0.0     20
6.0     17
4.0     16
9.0     14
2.0     14
7.0     13
13.0     8
8.0      8
20.0     7
15.0     7
10.0     6
22.0     6
14.0     6
17.0     6
12.0     6
16.0     5
18.0     5
23.0     4
11.0     4
24.0     4
39.0     3
19.0     3
21.0     3
32.0     3
28.0     2
26.0     2
25.0     2
30.0     2
36.0     2
29.0     1
33.0     1
42.0     1
27.0     1
41.0     1
35.0     1
38.0     1
34.0     1
49.0     1
31.0     1
Name: institute_service, dtype: int64

* Excluding the null values of the institute_service field from the DETE dataset, we observe that 42% of the employees worked at most 5 years.


In [21]:
dete_resignation['institute_service'].isnull().sum()

38

In [23]:
dete_resignation.loc[:,'job_dissatisfaction':'workload']

Unnamed: 0,job_dissatisfaction,dissatisfaction_with_the_department,physical_work_environment,lack_of_recognition,lack_of_job_security,work_location,employment_conditions,maternity/family,relocation,study/travel,ill_health,traumatic_incident,work_life_balance,workload
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,True,True,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9,True,True,False,False,False,False,False,False,False,False,False,False,False,False
11,False,False,False,False,False,False,False,True,True,False,False,False,False,False
12,False,False,False,False,False,False,False,True,False,False,False,False,False,False
14,True,True,False,False,False,False,False,False,False,False,False,False,False,False
16,False,False,False,True,False,False,False,False,True,False,False,False,False,False
20,False,False,False,False,False,False,False,False,False,False,False,False,False,False
21,False,False,False,False,False,False,False,False,False,False,False,False,False,False


So since these columns have boolean input values, even if one of the column to a corresponding row states false, then that employee will be classified as 'dissatisfied'. Before proceeding, I think we should not include columns such from maternity/family to traumatic_incident in columns such as dissatisfied as they are not related to the institute.



In [25]:
# we will create a column 'dissatisfied' which will be of Boolean type
# the role of 'any' is to return whether any element is True
def update_vals(x):
    if x=='-':
        return False
    elif pd.isnull(x):
        return np.nan
    else:
        return True
dete_resignation['dissatisfied']=dete_resignation[['job_dissatisfaction',
                                                  'dissatisfaction_with_the_department',
                                                  'physical_work_environment','lack_of_recognition',
                                                  'lack_of_job_security','work_location','employment_conditions',
                                                  'work_life_balance','workload']].any(1,skipna=False)
dete_resignation_up=dete_resignation.copy()
dete_resignation_up['dissatisfied'].value_counts(dropna=False)

False    162
True     149
Name: dissatisfied, dtype: int64

In [28]:
dete_resignation_up['dissatisfied'].head(10)

3     False
5      True
8     False
9      True
11    False
12    False
14     True
16     True
20    False
21    False
Name: dissatisfied, dtype: bool

In [30]:
tafe_survey.head()

Unnamed: 0,Record ID,Institute,WorkArea,CESSATION YEAR,Reason for ceasing employment,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,...,Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?,Workplace. Topic:Does your workplace promote and practice the principles of employment equity?,Workplace. Topic:Does your workplace value the diversity of its employees?,Workplace. Topic:Would you recommend the Institute as an employer to others?,Gender. What is your Gender?,CurrentAge. Current Age,Employment Type. Employment Type,Classification. Classification,LengthofServiceOverall. Overall Length of Service at Institute (in years),LengthofServiceCurrent. Length of Service at current workplace (in years)
0,6.34133e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2010.0,Contract Expired,,,,,,...,Yes,Yes,Yes,Yes,Female,26 30,Temporary Full-time,Administration (AO),1-2,1-2
1,6.341337e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Retirement,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
2,6.341388e+17,Mount Isa Institute of TAFE,Delivery (teaching),2010.0,Retirement,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,...,Yes,Yes,Yes,Yes,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4


In [31]:
tafe_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
Record ID                                                                                                                                                        702 non-null float64
Institute                                                                                                                                                        702 non-null object
WorkArea                                                                                                                                                         702 non-null object
CESSATION YEAR                                                                                                                                                   695 non-null float64
Reason for ceasing employment                                                                                                                                    701 non-

In [33]:
tafe_survey.isnull().sum()

Record ID                                                                                                                                                          0
Institute                                                                                                                                                          0
WorkArea                                                                                                                                                           0
CESSATION YEAR                                                                                                                                                     7
Reason for ceasing employment                                                                                                                                      1
Contributing Factors. Career Move - Public Sector                                                                                                                265
Contributi

* As some of the columns in both the columns are named different but imply the same, we might need to rename the columns so that there is order while combining the data and answering questions like:
* Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
* Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
* There are some cells in dete_survey which contain 'Not Stated' rather than Nan. While cleaning data we have to check uniformity in the columns, that is the values in the columns should have similar datatype and should basically belong to the same group as the other data. For example, if a column contains the marks of a student, we should make sure that there are no values which represent grades or percentage.
* There are many columns in both the surveys that are not needed for our analysis.
* The number of columns in both the surveys are not same, we will need to filter the useful columns.

In [68]:
columns = {'Record ID': 'id',
'CESSATION YEAR':'cease_date',
'Reason for ceasing employment':'separation_type',
'Gender. What is your Gender?':'gender',
'CurrentAge. Current Age':'age',
'Employment Type. Employment Type':'employment_status',
'Classification. Classification':'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)':'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)':'role_service',
'Contributing Factors. Dissatisfaction':'factors_diss',
'Contributing Factors. Job Dissatisfaction':'factors_job_diss'
        }
tafe_survey = tafe_survey.rename(columns=columns)

In [74]:
# we choose columns to drop on the same basis as we did for dete_survey
tafe_survey = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1)

In [75]:
tafe_survey.head()
tafe_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 17 columns):
id                                                     702 non-null float64
Institute                                              702 non-null object
WorkArea                                               702 non-null object
cease_date                                             695 non-null float64
separation_type                                        701 non-null object
Contributing Factors. Career Move - Public Sector      437 non-null object
Contributing Factors. Career Move - Private Sector     437 non-null object
Contributing Factors. Career Move - Self-employment    437 non-null object
Contributing Factors. Ill Health                       437 non-null object
Contributing Factors. Maternity/Family                 437 non-null object
factors_diss                                           437 non-null object
factors_job_diss                                       437 non-null 

In [76]:
tafe_survey.duplicated(['id']).value_counts()

False    702
dtype: int64

In [49]:
tafe_survey['separation_type'].value_counts(dropna=False)

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
NaN                           1
Name: separation_type, dtype: int64

In the DETE-Survey dataset we see that in the column separation_type there are 3 values containing the word 'Resignation'. Whereas, in the TAFE-Survey dataset, there is only one.

In [50]:
tafe_resignation=tafe_survey.copy()[tafe_survey['separation_type'].str.contains(r'Resignation',na=False)]
tafe_resignation.head()

Unnamed: 0,id,Institute,WorkArea,cease_date,separation_type,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,factors_diss,factors_job_diss,Contributing Factors. Interpersonal Conflict,Contributing Factors. Study,Contributing Factors. Travel,Contributing Factors. Other,Contributing Factors. NONE
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,-,-,-,-,Travel,-,-
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,-,-,-,-,-,-,-
5,6.341475e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,-,-,-,-,-,-,-,-,-,Other,-
6,6.34152e+17,Barrier Reef Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,Career Move - Private Sector,-,-,Maternity/Family,-,-,-,-,-,Other,-
7,6.341537e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,-,-,-,-,-,-,-,-,-,Other,-


In [52]:
tafe_resignation['cease_date'].value_counts()

2011.0    116
2012.0     94
2010.0     68
2013.0     55
2009.0      2
Name: cease_date, dtype: int64

In [77]:
#tafe_resignation['institute_service'].value_counts(dropna=False)

In [78]:
def update_vals(x):
    if x =='-':
        return False
    elif pd.isnull(x):
        return np.nan
    else:
        return True
tafe_resignation['dissatisfied']= tafe_resignation[['factors_diss', 'factors_job_diss']].applymap(update_vals).any(1, skipna=False)
tafe_resignation['dissatisfied'].head()
tafe_resignation_up = tafe_resignation.copy()
tafe_resignation_up['dissatisfied'].value_counts(dropna=False)

False    241
True      91
NaN        8
Name: dissatisfied, dtype: int64

In [79]:
dete_resignation_up['institute']='DETE'
tafe_resignation_up['institute']='TAFE'

In [80]:
combined=pd.concat([dete_resignation_up,tafe_resignation_up], ignore_index=True)
combined.shape

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  """Entry point for launching an IPython kernel.


(651, 53)

In [81]:
combined_null = combined.isnull().sum()
combined_null

Contributing Factors. Career Move - Private Sector     319
Contributing Factors. Career Move - Public Sector      319
Contributing Factors. Career Move - Self-employment    319
Contributing Factors. Ill Health                       319
Contributing Factors. Interpersonal Conflict           319
Contributing Factors. Maternity/Family                 319
Contributing Factors. NONE                             319
Contributing Factors. Other                            319
Contributing Factors. Study                            319
Contributing Factors. Travel                           319
Institute                                              311
WorkArea                                               311
aboriginal                                             644
age                                                    345
business_unit                                          619
career_move_to_private_sector                          340
career_move_to_public_sector                           3

In [82]:
columns_drops = combined_null[combined_null >= 400]

In [83]:
combined=combined.dropna(thresh=200,axis=1)
combined

Unnamed: 0,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Interpersonal Conflict,Contributing Factors. Maternity/Family,Contributing Factors. NONE,Contributing Factors. Other,Contributing Factors. Study,Contributing Factors. Travel,...,region,relocation,role_start_date,separation_type,separationtype,study/travel,traumatic_incident,work_life_balance,work_location,workload
0,,,,,,,,,,,...,Central Queensland,False,2006.0,Resignation-Other reasons,Resignation,False,False,False,False,False
1,,,,,,,,,,,...,Central Office,False,1997.0,Resignation-Other reasons,Resignation,False,False,False,False,False
2,,,,,,,,,,,...,North Queensland,False,2009.0,Resignation-Other reasons,Resignation,False,False,False,False,False
3,,,,,,,,,,,...,,False,2008.0,Resignation-Other employer,Resignation,False,False,False,False,False
4,,,,,,,,,,,...,Far North Queensland,True,2009.0,Resignation-Move overseas/interstate,Resignation,False,False,False,False,False
5,,,,,,,,,,,...,Far North Queensland,False,1998.0,Resignation-Other reasons,Resignation,False,False,False,False,False
6,,,,,,,,,,,...,Central Queensland,False,2010.0,Resignation-Other employer,Resignation,False,False,False,False,False
7,,,,,,,,,,,...,South East,True,,Resignation-Other reasons,Resignation,False,False,False,False,False
8,,,,,,,,,,,...,Central Queensland,False,1982.0,Resignation-Other employer,Resignation,False,False,False,False,False
9,,,,,,,,,,,...,Darling Downs South West,False,2009.0,Resignation-Other reasons,Resignation,False,False,False,False,False


In [84]:
combined['institute_service']

0       7.0
1      18.0
2       3.0
3      15.0
4       3.0
5      14.0
6       5.0
7       NaN
8      30.0
9      32.0
10     15.0
11     39.0
12     17.0
13      7.0
14      9.0
15      6.0
16      1.0
17      NaN
18     35.0
19     38.0
20      1.0
21     36.0
22      3.0
23      3.0
24     19.0
25      4.0
26      9.0
27      1.0
28      6.0
29      1.0
       ... 
621     NaN
622     NaN
623     NaN
624     NaN
625     NaN
626     NaN
627     NaN
628     NaN
629     NaN
630     NaN
631     NaN
632     NaN
633     NaN
634     NaN
635     NaN
636     NaN
637     NaN
638     NaN
639     NaN
640     NaN
641     NaN
642     NaN
643     NaN
644     NaN
645     NaN
646     NaN
647     NaN
648     NaN
649     NaN
650     NaN
Name: institute_service, Length: 651, dtype: float64

In [85]:
type(combined['institute_service'][2])

numpy.float64

In [86]:
def service_category(val):
    if pd.isna(val):
        return np.nan
    elif val < 3:
        return 'New'
    elif val < 7:
        return 'Experienced'
    elif val < 11:
        return 'Established'
    else:
        return 'Veteran'

In [87]:
combined['institute_service'].value_counts(dropna=False)

NaN      378
 5.0      23
 1.0      22
 3.0      20
 0.0      20
 6.0      17
 4.0      16
 9.0      14
 2.0      14
 7.0      13
 13.0      8
 8.0       8
 20.0      7
 15.0      7
 12.0      6
 22.0      6
 17.0      6
 10.0      6
 14.0      6
 16.0      5
 18.0      5
 24.0      4
 23.0      4
 11.0      4
 39.0      3
 32.0      3
 19.0      3
 21.0      3
 36.0      2
 30.0      2
 25.0      2
 28.0      2
 26.0      2
 29.0      1
 42.0      1
 38.0      1
 27.0      1
 41.0      1
 35.0      1
 49.0      1
 34.0      1
 33.0      1
 31.0      1
Name: institute_service, dtype: int64