# Individual Assignment 1: Data Description, EDA, and Visualization 

Jay Tan | 480909583

## Data Description

Our dataset is the [employee dataset](https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset) which can be found on Kaggle. The data was provided by the HR department of an undisclosed company and contains the anonymized information of 4653 employees. There are 9 features (variables):

- `Education`: the education level of the employee. This is a categorical variable with 3 levels: Bachelors, Masters, and PHD.
- `JoiningYear`: the year that the employee joined the company.
- `City`: the city that the employee is based in. This is a categorical variable with 3 levels: Bangalore, Pune, and New Delhi.
- `PaymentTier`: the salary tier the employee is in. This is a categorical variable with 3 levels: 1, 2, and 3. 
- `Age`: the age of the employee.
- `Gender`: the gender of the employee. This is a categorical variable with 2 levels: Male and Female.
- `EverBenched`: whether the employee has ever not been assigned work for a temporary period. This is a categorical variable with 2 levels: Yes and No.
- `ExperienceInCurrentDomain`: the employee's years of experience in their current field of work.
- `LeaveOrNot`: whether the employee is still with the company. This is a categorical variable with 2 levels: 1 (employee left) and 0 (employee did not leave).

## Question

In this project, I want to determine whether it is possible to accurately predict (classify) whether an employee will leave the company using some combination of information about their education, joining year, location, pay, age, gender, benched status, and job experience. These factors may affect an employee's desire to stay with a company, which makes them potentially good predictors. For example, low pay might drive employees to move to another job, and being benched may cause an employee to seek other opportunities where their skills will be utilized. This question is primarily focused on prediction, since what we're interested in is whether the model will accurately classify the employees.

## EDA

In [2]:
# Run this cell before continuing.
library(tidyverse)
library(repr)
library(infer)
library(gridExtra)
library(caret)
library(pROC)
library(boot)
library(glmnet)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine


Loading required package: lattice


Attaching package: ‘caret’


Th

In [13]:
employee_dat <- read_csv("https://raw.githubusercontent.com/jtan29/stat-301-project/main/Employee.csv")
head(employee_dat)

[1mRows: [22m[34m4653[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): Education, City, Gender, EverBenched
[32mdbl[39m (5): JoiningYear, PaymentTier, Age, ExperienceInCurrentDomain, LeaveOrNot

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
<chr>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
Bachelors,2017,Bangalore,3,34,Male,No,0,0
Bachelors,2013,Pune,1,28,Female,No,3,1
Bachelors,2014,New Delhi,3,38,Female,No,2,0
Masters,2016,Bangalore,3,27,Male,No,5,1
Masters,2017,Pune,3,24,Male,Yes,2,1
Bachelors,2016,Bangalore,3,22,Male,No,0,0


First, we'll have to change the categorical variables to be factors.

In [38]:
employee_dat <- employee_dat %>%
mutate(Education = as.factor(Education)) %>%
mutate(PaymentTier = as.factor(PaymentTier)) %>%
mutate(Gender = as.factor(Gender)) %>%
mutate(EverBenched = as.factor(EverBenched)) %>%
mutate(ExperienceInCurrentDomain = as.factor(ExperienceInCurrentDomain)) %>%
mutate(LeaveOrNot = as.factor(LeaveOrNot))

head(employee_dat)

Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
<fct>,<dbl>,<chr>,<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>
Bachelors,2017,Bangalore,3,34,Male,No,0,0
Bachelors,2013,Pune,1,28,Female,No,3,1
Bachelors,2014,New Delhi,3,38,Female,No,2,0
Masters,2016,Bangalore,3,27,Male,No,5,1
Masters,2017,Pune,3,24,Male,Yes,2,1
Bachelors,2016,Bangalore,3,22,Male,No,0,0


Next, I'll check if we're missing any data:

In [39]:
sum(is.na(employee_dat))

Next, I'll generate some visualizations for the data to better understand some of the relationships before moving on to developing my model.

In [7]:
employee_dat %>%
group_by(Education) %>%
summarize(n())

Education,n()
<chr>,<int>
Bachelors,3601
Masters,873
PHD,179
