# HR Data Investigation

This project looks to analyze employee satisfaction, performance, and resignation rates using HR data with the following table schemas:


employees
| Column Name       | Data Type            |
|------------------|----------------------|
| emp_id          | integer              |
| first_name      | character varying    |
| last_name       | character varying    |
| start_date      | character varying    |
| exit_date       | character varying    |
| title          | character varying    |
| supervisor     | character varying    |
| email          | character varying    |
| business_unit  | character varying    |
| employee_type  | character varying    |
| payzone        | character varying    |
| termination_type | character varying  |
| department     | character varying    |
| division       | character varying    |
| dob           | character varying    |
| state         | character varying    |
| sex           | character varying    |
| location_code | integer              |
| race          | character varying    |
| marital_status | character varying   |
| performance_score | character varying |
| employee_rating  | integer            |

surveys
| Column Name                  | Data Type          |
|------------------------------|--------------------|
| emp_id                       | integer           |
| survey_date                  | character varying |
| engagement_score             | integer           |
| satisfaction_score           | integer           |
| work_life_balance_score      | integer           |

training
| Column Name     | Data Type          |
|----------------|--------------------|
| emp_id         | integer           |
| training_date  | character varying |
| program_name   | character varying |
| training_type  | character varying |
| outcome       | character varying |
| location      | character varying |
| trainer       | character varying |
| duration      | integer           |
| cost          | decimal           |

In [1]:
-- Query to connect workspace to my database. 
SELECT *
FROM auth.audit_log_entries
LIMIT 5

Unnamed: 0,instance_id,id,payload,created_at,ip_address


## Where are we seeing the highest resignation rates?

Let's first take a broad look at our dataset to see what kinds of terminations/resignations we're working with.

In [2]:
SELECT 
  termination_type,
  COUNT(emp_id)
FROM employees
WHERE LENGTH(exit_date) != 0
GROUP BY termination_type; 

Unnamed: 0,termination_type,count
0,Retirement,377
1,Involuntary,388
2,Voluntary,388
3,Resignation,380


It looks like our data is pretty evenly distributed between 4 "termination types". Since we are interested in looking at why people are deciding to leave the company prematuraly, so lets focus our following queries on "Voluntary" and "Resignation" types. 

In [3]:
SELECT 
  division,
  COUNT(*) AS total_resignations,
  ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM employees e2 WHERE e.division = e2.division), 2) AS resignation_perc
FROM employees e
WHERE termination_type IN('Resignation', 'Voluntary')
GROUP BY division
ORDER BY total_resignations DESC
LIMIT 10;

Unnamed: 0,division,total_resignations,resignation_perc
0,Field Operations,192,24.33
1,General - Con,138,27.11
2,Engineers,69,25.09
3,Wireline Construction,52,28.89
4,Aerial,50,25.51
5,Project Management - Con,41,23.03
6,General - Eng,25,29.07
7,Fielders,25,30.49
8,General - Sga,23,19.66
9,Splicing,21,19.09


This query shows a count of total resignations within each division and a percentage of how many employees within each division and quit/resigned. We see that while Field Operations and Gerneral - Con are seeing the highest absolute numbers of resignations, they are pretty close to the other divisions in terms of resignation percentage. Let see if we can find more significant disparities by looking within the context of a different scope.

In [4]:
SELECT 
  business_unit,
  COUNT(*) AS total_resignations,
  ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM employees e2 WHERE e.business_unit = e2.business_unit), 2) AS resignation_perc
FROM employees e
WHERE termination_type IN('Resignation', 'Voluntary')
GROUP BY business_unit
ORDER BY total_resignations DESC
LIMIT 10;

Unnamed: 0,business_unit,total_resignations,resignation_perc
0,SVG,102,33.55
1,NEL,81,26.64
2,PYZ,77,25.75
3,MSC,77,26.01
4,WBL,76,25.85
5,BPC,75,24.75
6,CCDR,75,25.0
7,EW,71,23.51
8,PL,68,22.59
9,TNS,66,22.22


Here we are doing the same as in the prior query, this time aggregating on business unit. One finding that pops out is that the SVG business unit leads all others in terms of both total number of resignations and resignation percentage. It might be wise to take a deeper look at this unit and see what might going on. 

## Could we predict which employees were most likely to quit?

Let's try to find some metrics that seem to be significantly correlated with employee resignations. An easy first place to look would be at the responses in our surveys table. 

In [5]:
SELECT 
  termination_type,
  ROUND(AVG(engagement_score), 2) AS engagement_avg,
  ROUND(AVG(satisfaction_score), 2) AS satisfaction_avg,
  ROUND(AVG(work_life_balance_score), 2) AS worklife_avg,
  ROUND(AVG(engagement_score + satisfaction_score + work_life_balance_score), 2) AS total_avg
FROM employees
LEFT JOIN surveys
  USING(emp_id)
WHERE termination_type IN('Voluntary', 'Resignation', 'Unk')
GROUP BY termination_type
ORDER BY total_avg DESC;

Unnamed: 0,termination_type,engagement_avg,satisfaction_avg,worklife_avg,total_avg
0,Unk,2.94,3.03,3.03,9.0
1,Resignation,2.98,2.99,2.88,8.84
2,Voluntary,2.9,2.95,2.99,8.84


This query shows the average score employees gave on each of the survey metrics along with a fourth column (total_avg) that takes the sum of each of these scores. Comparing these averages, we see a slight increase in total score average within the "Unk" (Still employed) group compared to the two resignation groups. Though this difference makes sense, it is not an overly substantial one. Let's see if there are any relationships between training program completion and resignation rate. 

(Note: Unk termination type means that employee is still actively employed with the company and has not quit.)

In [6]:
SELECT 
  termination_type,
  outcome,
  ROUND(COUNT(e.emp_id) * 100.0 / (SELECT COUNT(*) FROM employees e2 WHERE e.termination_type = e2.termination_type), 2) AS percentage_of_group
FROM employees e
INNER JOIN training
  USING(emp_id)
WHERE termination_type IN('Resignation', 'Voluntary', 'Unk')
  AND outcome = 'Passed'
GROUP BY termination_type, outcome
ORDER BY 3 DESC

Unnamed: 0,termination_type,outcome,percentage_of_group
0,Unk,Passed,25.84
1,Resignation,Passed,23.16
2,Voluntary,Passed,22.94


Here we see that while small, a higher percentage of currently employed individuals have passed a training program (about 1 in 4) compared to those who have quit. Lets see if similar logic applies to incomplete or failed training outcomes.

In [7]:
SELECT 
  termination_type,
  outcome,
  ROUND(COUNT(e.emp_id) * 100.0 / (SELECT COUNT(*) FROM employees e2 WHERE e.termination_type = e2.termination_type), 2) AS percentage_of_group
FROM employees e
INNER JOIN training
  USING(emp_id)
WHERE termination_type IN('Resignation', 'Voluntary', 'Unk')
  AND outcome IN('Failed', 'Incomplete')
GROUP BY termination_type, outcome
ORDER BY termination_type, outcome DESC

Unnamed: 0,termination_type,outcome,percentage_of_group
0,Resignation,Incomplete,25.26
1,Resignation,Failed,25.26
2,Unk,Incomplete,24.88
3,Unk,Failed,22.29
4,Voluntary,Incomplete,28.61
5,Voluntary,Failed,26.03


With this query we are seeing that the lowest rates of failing or incomplete training outcomes are in the currently employed group, while the hightest are in the voluntary termination groups. While these disparities are not overly large, they are still present. This might suggest that negative training experience could be a contributing factor toward employee resignation. But it is also just as likely to suggest that employees already considering quitting just did not give their most effort, leading to less than ideal outcomes. 

## How do training programs impact employee metrics?

While we saw with previous queries that there might be a small relationship between training outcomes and employee turnover rates, it could also be interesting to look at just how much of an impact these training programs might be having on the performance and satisfaction of our current employees. 

In [8]:
SELECT
  outcome, 
  ROUND(AVG(employee_rating), 2) AS employee_rating_avg,
  ROUND(AVG(CASE WHEN performance_score = 'PIP' THEN 1
    WHEN performance_score = 'Needs Improvement' THEN 2
    WHEN performance_score = 'Fully Meets' THEN 3
    WHEN performance_score = 'Exceeds' THEN 4
    ELSE NULL END),2) AS performance_avg,
	COUNT(*) AS total_participants
FROM employees e
INNER JOIN training
  USING(emp_id)
WHERE LENGTH(exit_date) = 0
GROUP BY outcome
ORDER BY employee_rating_avg DESC

Unnamed: 0,outcome,employee_rating_avg,performance_avg,total_participants
0,Completed,3.04,3.01,396
1,Incomplete,2.94,3.02,365
2,Failed,2.93,2.98,327
3,Passed,2.93,3.01,379


With this query we have generated the average employee rating, and a numerized form of the employee review column to see how employee performance is connected with training outcomes. We see that average employee rating is highest with current employees that have completed a training program. This is interesing, but lets see how else training outcomes might be correlated with employee metrics. 

In [9]:
SELECT
  outcome, 
  ROUND(AVG(engagement_score), 2) AS engagement_avg,
  ROUND(AVG(satisfaction_score), 2) AS satisfaction_avg,
  ROUND(AVG(work_life_balance_score), 2) AS worklife_avg,
  ROUND(AVG(engagement_score + satisfaction_score + work_life_balance_score), 2) AS total_avg,
  COUNT(*) AS total_participants
FROM training 
INNER JOIN surveys
  USING(emp_id)
GROUP BY outcome
ORDER BY total_avg DESC

Unnamed: 0,outcome,engagement_avg,satisfaction_avg,worklife_avg,total_avg,total_participants
0,Completed,2.96,3.08,2.97,9.0,770
1,Incomplete,2.95,2.99,3.05,8.99,775
2,Failed,2.95,3.0,2.97,8.92,716
3,Passed,2.91,3.02,2.96,8.88,739


Here we're taking a look at employee survey responses to see if training programs might be having any impact on employee satisfaction. The data does not seem to show any strong correlations one way or another, perhaps a very slight increase in average satisfaction levels for employees with a completed training, but this could just be random variability. So far we've been looking at training outcomes as a whole, what if we look at differences between specific training programs?

In [10]:
SELECT
  program_name,
  ROUND(AVG(employee_rating), 2) AS employee_rating_avg,
  ROUND(AVG(CASE WHEN performance_score = 'PIP' THEN 1
    WHEN performance_score = 'Needs Improvement' THEN 2
    WHEN performance_score = 'Fully Meets' THEN 3
    WHEN performance_score = 'Exceeds' THEN 4
    ELSE NULL END),2) AS performance_avg,  
  ROUND(AVG(engagement_score + satisfaction_score + work_life_balance_score), 2) AS survey_avg,
  COUNT(*) AS total_participants
FROM employees e 
INNER JOIN training t
  USING(emp_id)
INNER JOIN surveys s
  USING(emp_id)
GROUP BY program_name
ORDER BY employee_rating_avg DESC

Unnamed: 0,program_name,employee_rating_avg,performance_avg,survey_avg,total_participants
0,Technical Skills,2.99,2.99,9.07,579
1,Communication Skills,2.98,2.99,8.78,673
2,Customer Service,2.97,3.03,8.93,565
3,Leadership Development,2.95,2.98,9.04,574
4,Project Management,2.95,3.02,8.96,609


Taking a look at performance metrics tied with each of the training program types, the only highlight that jumps out is the fairly low survey average connected with employees who have taken part in the "Communication Skills" program. Perhaps this is a program that could use a bit of revamping following some employee feedback. While there are some interesting findings, the results we have seen so far don't show the greatest returns from out training programs. Let's take a look at the costs of our programs and see if they seem to be worth the investment. 

In [11]:
WITH cost_bounds AS (
    SELECT 
        percentile_cont(0.33) WITHIN GROUP (ORDER BY cost) AS low_threshold,
        percentile_cont(0.66) WITHIN GROUP (ORDER BY cost) AS mid_threshold
    FROM training
),
cost_groups AS (
    SELECT 
        e.emp_id,
        t.cost,
        CASE 
            WHEN t.cost <= cb.low_threshold THEN 'Low Cost'
            WHEN t.cost <= cb.mid_threshold THEN 'Mid Cost'
            ELSE 'High Cost'
        END AS cost_group,
        e.employee_rating,
        CASE 
            WHEN e.performance_score = 'PIP' THEN 1
            WHEN e.performance_score = 'Needs Improvement' THEN 2
            WHEN e.performance_score = 'Fully Meets' THEN 3
            WHEN e.performance_score = 'Exceeds' THEN 4
            ELSE NULL 
        END AS performance_score_numeric,
        s.engagement_score,
        s.satisfaction_score,
        s.work_life_balance_score
    FROM employees e
    INNER JOIN training t USING(emp_id)  -- Using training table for cost
    INNER JOIN surveys s USING(emp_id)
    CROSS JOIN cost_bounds cb
)
SELECT 
    cost_group,
    ROUND(AVG(employee_rating), 2) AS employee_rating_avg,
    ROUND(AVG(performance_score_numeric), 2) AS performance_avg,  
    ROUND(AVG(engagement_score + satisfaction_score + work_life_balance_score), 2) AS survey_avg,
    COUNT(*) AS total_participants
FROM cost_groups
GROUP BY cost_group
ORDER BY performance_avg DESC;

Unnamed: 0,cost_group,employee_rating_avg,performance_avg,survey_avg,total_participants
0,Low Cost,2.95,3.02,8.94,990
1,High Cost,2.97,3.0,9.0,1020
2,Mid Cost,2.98,2.99,8.91,990


This query splits our training programs into a Low Cost, Mid Cost, and High Cost group to see if the extra money we are spending is leading to more positive outcomes. The data does not look promise, with only a very slight increase to survey averages for the high cost group, and no discernible improvement to employee performance metrics. Perhaps we need to rethink our training programs before we continue to commit much more resources to them. 