## Final Project - Justin McElderry - Predict Company Turnover

#### Section 1: Problem Statement & Hypothesis

- Problem Statement: I want to predict whether or not someone will leave a company based on different features.

- Hypothesis: I think that if employees have not had a promotion in the past 5 years, they are more likely to leave the company.

In terms of machine learning, I am going to need to use some classification technique. Ultimately, the output of my project will be a binary value "left" vs. "did not leave". While this technique is powerful, there is a certain amount of risk associated with making a binary decision based on some possibly subjective data. I'll use the data dictionary below to describe the columns. 

As stated above in my hypothesis, I believe that receiving a promotion at least every 5 years is the biggest indicator of whether or not someone will leave a company. 

#### Section 2: Describing the Dataset

Below is the dictionary of the data set I'm using.

Size of Dataset: 
- 15000 rows
- 10 columns
- 567KB

Features:
- satisfaction_level: 
    - employee satisfaction level measured on a scale from zero to 1
    - data type: continuous
    
- last_evaluation: 
    - the last time an employee was evaluated by the employer for performance (measured in years)
    - data type: continuous
    
- number_project: 
    - the number of projects an employee is working on during a given time frame
    - data type: continuous
    
- time_spend_company: 
    - the number of years an employee has been at the company
    - data type: continuous
    
- Work_accident: 
    - whether or not they have had a work 'accident' (something that injured them) throughout their tenure
    - data type: binary 
    
- promotion_last_5years: 
    - whether or not the employee has received a promotion in the last 5 years at the company
    - data type: binary

- sales: 
    - a list of departments in which an employee can work 
    - data type: categorical 

- salary: 
    - categories of salary ranks based on predetermined classification logic 
    - data type: categorical 

- average_montly_hours: average hours of employee working time per month
    - data type: continuous


Output: 
- left: 
    - Whether or not the employee has left
    - data type: binary 

#### Section 3: Domain Knowledge 

- I've spent the last year in a product management role for a B2B SaaS HR Analytics product. Our product provides Fortune 500 companies with a historical view of how they are allocating their labor spend. Ultimately, the goal of the product is to provide companies with suggested remedies for how they can save money and optimize their labor-related processes. 

- Because of my role on this product, I've gained a great deal of subject matter expertise in the field HR. As you'll see below, I recognize that there are a laundry list of missing factors which could significantly impact the outcome of my prediction. 

- I've spoken to several people at work about previous efforts to answer this question. However, the struggle has always been in the flexibility of the model. In other words, because the model was always built on public employee data for large, services firms, people kept getting predictable results. For example, at consulting firms, people who consistently travel to more than 3 different cities per week are most likely to leave--something that could have been easily determined. 

#### Section 4: Project Concerns

Main Concern: 

As shown below, I'm well aware that there are some missing predictors that could make the model much more robust. However, I'm not sure how to go about getting data to match each of the features listed below. 

Missing Data: 
- Whether an employee is full-time or part-time
- The shifts an employee is scheduled for -- i.e. the amount of times an employee had to work an odd shift
- Amount of times an employee had to travel in a year
- Size of household 
- Per capita income of the employee 
- Number of years of marriage of an employee
- Age of employee 
- Number of times an employee had to remain on-call in a year
- Average number of overtime hours worked by an employee per month
- Highest degree obtained by an employee
- Actual salary data instead of categorical data
    

Key Assumption: 
- There seems to be a time element missing from the provided dataset. Ideally, I'd like the model to be able to predict the outcomes based on multiple years of data and even specific months of data (to see the affects of seasonality). 

Benefit to Society: 

My goal with this analysis is develop a more flexible model to be used by startups during their Series A/Series B periods. One problem I've observed is that many startups have no trouble acquiring top talent once they get funding. But, few of them understand what it takes to keep the talent when the company starts going through rough times. Look at Yik Yak for example. 

Risk of the Model: 

If my model is wrong and it is given to a startup, there's a big risk that a company will make extra efforts (i.e. spend money) to please and retain employees who aren't at risk of leaving. On the other hand, my model could cause employers to neglect employees who are at risk of leaving.

#### Section 5: Outcome

Expectations: 

I took a minute to do a quick count of employees who had received a promotion in the past 5 years. That number was a miniscule 319 out of 15,000 total employees. Even though a number of these records are true because employees have not even been at the company for 5 years. I still expect that most employees in this data set will leave. 

Target Audience: 

Because the target audience will be leaders at startups, they will likely not expect their employees to leave because there is a rising trend of treating employees exceptionally in order to retain top talent. 

Model Complexity: 

If I'm able to get access to some more data and fill in the missing gaps I've stated above, then I expect my model to get pretty complex. 

Definition of Success: Building a model that is able to predict employee retention with a great accuracy score. 