# Project 6 - Clean and Analyze Employee Exit Surveys

## 1. Introduction:

This project aims cleaning and analyzing exit surveys to learn how different factors affect employee resignations.

To reach this goal, we wil try to reply for the following questions:
* Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
* Are younger employees resigning due to some kind of dissatisfaction? What about older employees?




We will work with two modified datasets from exit surveys from employees of:
* the Department of Education, Training and Employment ([DETE](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey))
* and the Technical and Further Education ([TAFE](https://en.wikipedia.org/wiki/TAFE_Queensland)) institute in Queensland, Australia. 

## Initial overviewing of datasets :

### 2.1 Importing datasets

In [5]:
import numpy as np
import pandas as pd

In [8]:
dete_survey=pd.read_csv("dete_survey.csv")

In [7]:
tafe_survey=pd.read_csv("tafe_survey.csv")

### 2.2 Checking first and last 5 rows of each dataset: 

#### 2.2.1 dete_survey:

In [11]:
dete_survey

Unnamed: 0,ID,SeparationType,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,...,Kept informed,Wellness programs,Health & Safety,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
0,1,Ill Health Retirement,08/2012,1984,2004,Public Servant,A01-A04,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,N,N,N,Male,56-60,,,,,Yes
1,2,Voluntary Early Retirement (VER),08/2012,Not Stated,Not Stated,Public Servant,AO5-AO7,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,N,N,N,Male,56-60,,,,,
2,3,Voluntary Early Retirement (VER),05/2012,2011,2011,Schools Officer,,Central Office,Education Queensland,Permanent Full-time,...,N,N,N,Male,61 or older,,,,,
3,4,Resignation-Other reasons,05/2012,2005,2006,Teacher,Primary,Central Queensland,,Permanent Full-time,...,A,N,A,Female,36-40,,,,,
4,5,Age Retirement,05/2012,1970,1989,Head of Curriculum/Head of Special Education,,South East,,Permanent Full-time,...,N,A,M,Female,61 or older,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
817,819,Age Retirement,02/2014,1977,1999,Teacher,Primary,Central Queensland,,Permanent Part-time,...,A,A,SA,Female,56-60,,,,,
818,820,Age Retirement,01/2014,1980,1980,Teacher,Secondary,North Coast,,Permanent Full-time,...,N,N,N,Male,51-55,,,,,
819,821,Resignation-Move overseas/interstate,01/2014,2009,2009,Public Servant,A01-A04,Central Office,Education Queensland,Permanent Full-time,...,A,N,A,Female,31-35,,,,,
820,822,Ill Health Retirement,12/2013,2001,2009,Teacher,Secondary,Darling Downs South West,,Permanent Full-time,...,A,N,A,Female,41-45,,,,,


In [12]:
dete_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   ID                                   822 non-null    int64 
 1   SeparationType                       822 non-null    object
 2   Cease Date                           822 non-null    object
 3   DETE Start Date                      822 non-null    object
 4   Role Start Date                      822 non-null    object
 5   Position                             817 non-null    object
 6   Classification                       455 non-null    object
 7   Region                               822 non-null    object
 8   Business Unit                        126 non-null    object
 9   Employment Status                    817 non-null    object
 10  Career move to public sector         822 non-null    bool  
 11  Career move to private sector        822 non-

In [17]:
dete_survey.isnull().sum()

ID                                       0
SeparationType                           0
Cease Date                               0
DETE Start Date                          0
Role Start Date                          0
Position                                 5
Classification                         367
Region                                   0
Business Unit                          696
Employment Status                        5
Career move to public sector             0
Career move to private sector            0
Interpersonal conflicts                  0
Job dissatisfaction                      0
Dissatisfaction with the department      0
Physical work environment                0
Lack of recognition                      0
Lack of job security                     0
Work location                            0
Employment conditions                    0
Maternity/family                         0
Relocation                               0
Study/Travel                             0
Ill Health 

#### 2.2.2 tafe_survey:

In [14]:
tafe_survey

Unnamed: 0,Record ID,Institute,WorkArea,CESSATION YEAR,Reason for ceasing employment,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,...,Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?,Workplace. Topic:Does your workplace promote and practice the principles of employment equity?,Workplace. Topic:Does your workplace value the diversity of its employees?,Workplace. Topic:Would you recommend the Institute as an employer to others?,Gender. What is your Gender?,CurrentAge. Current Age,Employment Type. Employment Type,Classification. Classification,LengthofServiceOverall. Overall Length of Service at Institute (in years),LengthofServiceCurrent. Length of Service at current workplace (in years)
0,6.341330e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2010.0,Contract Expired,,,,,,...,Yes,Yes,Yes,Yes,Female,26 30,Temporary Full-time,Administration (AO),1-2,1-2
1,6.341337e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Retirement,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
2,6.341388e+17,Mount Isa Institute of TAFE,Delivery (teaching),2010.0,Retirement,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,...,Yes,Yes,Yes,Yes,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
697,6.350668e+17,Barrier Reef Institute of TAFE,Delivery (teaching),2013.0,Resignation,Career Move - Public Sector,-,-,-,-,...,Yes,Yes,Yes,Yes,Male,51-55,Temporary Full-time,Teacher (including LVT),1-2,1-2
698,6.350677e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2013.0,Resignation,Career Move - Public Sector,-,-,-,-,...,,,,,,,,,,
699,6.350704e+17,Tropical North Institute of TAFE,Delivery (teaching),2013.0,Resignation,-,-,-,-,-,...,Yes,Yes,Yes,Yes,Female,51-55,Permanent Full-time,Teacher (including LVT),5-6,1-2
700,6.350712e+17,Southbank Institute of Technology,Non-Delivery (corporate),2013.0,Contract Expired,,,,,,...,No,No,Yes,No,Female,41 45,Temporary Full-time,Professional Officer (PO),1-2,1-2


In [16]:
tafe_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
 #   Column                                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                                         --------------  -----  
 0   Record ID                                                                                                                                                      702 non-null    float64
 1   Institute                                                                                                                                                      702 non-null    object 
 2   WorkArea                                                                                                                                  

In [22]:
tafe_survey.isnull().sum().sort_values()

Record ID                                                          0
Institute                                                          0
WorkArea                                                           0
Reason for ceasing employment                                      1
CESSATION YEAR                                                     7
                                                                ... 
Contributing Factors. Career Move - Public Sector                265
Contributing Factors. Maternity/Family                           265
Contributing Factors. Ill Health                                 265
InductionInfo. Topic:Did you undertake a Corporate Induction?    270
Main Factor. Which of these was the main factor for leaving?     589
Length: 72, dtype: int64

### 2.3. First notes about datasets:

#### Notes of dete_survey: 
 - dataset dimension: 822 rows × 56 columns
 - columns with many missing values which we will need mandatory explore and try to understand the reason and if we have any solution:

        Business Unit                          696
        Aboriginal                             806
        Torres Strait                          819
        South Sea                              815
        Disability                             799
        NESB                                   790

- Then we have following column still with many missing values, although less than above ones:
        
        Classification                         367
      
- Dates columns `Cease Date` and`DETE Start Date` contains value "Not Stated" -  this kind of information must be replace by NaN in order to be list as null field.


#### Notes of tafe_survey: 
- dataset dimension: 702 rows × 72 columns
- the columns with more missing values are the following:

        Contributing Factors. Career Move - Public Sector                265
        Contributing Factors. Maternity/Family                           265
        Contributing Factors. Ill Health                                 265
        InductionInfo. Topic:Did you undertake a Corporate Induction?    270
        Main Factor. Which of these was the main factor for leaving?     589
        
        
    

#### Note common for both dataset:
- Columns with date values are not with datetime type - this columns type must be changed, because we will need to use this information to check how many time workers were in their job before their resignation.
- Columns name must be simplified and the same in both datasets, mainly the ones that we already know that we will use: age, genre, dates, reason for person left from the job, ...


### 2.4. Datasets dictionary: 


From the information collected, the dictionary for the main columns should be according to the following:
* **dete_survey:**
    - `ID`: An id used to identify the participant of the survey
    - `SeparationType`: Tason why the person's employment ended
    - `Cease Date`: The year or month the person's employment ended
    - `DETE Start Date`: The year the person began employment with the DETE


* **tafe_survey:**
    - `Record ID`: An id used to identify the participant of the survey
    - `Reason for ceasing employment:` The reason why the person's employment ended
    - `LengthofServiceOverall`. `Overall Length of Service at Institute (in years)`: The length of the person's employment (in years)
