### Homework 2: Data Analysis for Recruitment Data
Owen Randolph, 10/26/2024, DSCI-590: Applied Data Science

Data Description: The dataset provides data from an HR department of a medical biotechnology company.  It was authored on 9/2/2016.  The author, Ben Teusch, is a Datacamp instructor and People Analytics Partner at Facebook. This dataset was found at https://github.com/teuschb/hr_data/blob/master/datasets/recruitment_evaluation_data.csv. 

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Load and Prepare Data

In [53]:
# Read in the recruitment data
data = pd.read_csv("recruitment_evaluation_data.csv")

In [54]:
data.head()

Unnamed: 0.1,Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,...,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,SalesRating,HireSource,Campus
0,1,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,...,8,0,1,6,4,0,5,1.08819,Applied Online,
1,2,49,0,Travel_Frequently,279,Research & Development,8,1,Life Sciences,2,...,10,3,3,10,7,1,7,,,
2,3,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,4,...,7,3,3,0,0,0,0,,Campus,Tech
3,4,33,0,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,5,...,8,3,3,8,7,3,0,,,
4,5,27,0,Travel_Rarely,591,Research & Development,2,1,Medical,7,...,6,3,3,2,2,2,2,,Referral,


In [55]:
# Check the number of rows and columns of the dataset
data.shape

(1470, 36)

In [56]:
# Check the data types in columns
data.dtypes

Unnamed: 0                    int64
Age                           int64
Attrition                     int64
BusinessTravel               object
DailyRate                     int64
Department                   object
DistanceFromHome              int64
Education                     int64
EducationField               object
EmployeeNumber                int64
EnvironmentSatisfaction       int64
Gender                       object
HourlyRate                    int64
JobInvolvement                int64
JobLevel                      int64
JobRole                      object
JobSatisfaction               int64
MonthlyIncome                 int64
MonthlyRate                   int64
NumCompaniesWorked            int64
OverTime                     object
PercentSalaryHike             int64
PerformanceRating             int64
RelationshipSatisfaction      int64
StandardHours                 int64
StockOptionLevel              int64
TotalWorkingYears             int64
TrainingTimesLastYear       

In [57]:
# Display summary statistics
data.describe()

Unnamed: 0.1,Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,Education,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,SalesRating
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,446.0
mean,735.5,36.92381,0.161224,802.485714,9.192517,2.912925,1024.865306,2.721769,65.891156,2.729932,...,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129,1.082607
std,424.496761,9.135373,0.367863,403.5091,8.106864,1.024165,602.024335,1.093082,20.329428,0.711561,...,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136,0.710042
min,1.0,18.0,0.0,102.0,1.0,1.0,1.0,1.0,30.0,1.0,...,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.710827
25%,368.25,30.0,0.0,465.0,2.0,2.0,491.25,2.0,48.0,2.0,...,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0,0.584444
50%,735.5,36.0,0.0,802.0,7.0,3.0,1020.5,3.0,66.0,3.0,...,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0,1.070143
75%,1102.75,43.0,0.0,1157.0,14.0,4.0,1555.75,4.0,83.75,3.0,...,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0,1.532488
max,1470.0,60.0,1.0,1499.0,29.0,5.0,2068.0,4.0,100.0,4.0,...,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0,3.66674


For the sake of clarity, and since it does not mention metrics in the ReadMe file, I will assume for the purposes of this assignment that:
DailyRate & Hourly Rate: USD, DistanceFromHome: miles.  The Education feature is not clear enough and will be dropped, along with "Unnamed: 0".

In [58]:
data.columns = data.columns.str.strip()
data.head()

Unnamed: 0.1,Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,...,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,SalesRating,HireSource,Campus
0,1,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,...,8,0,1,6,4,0,5,1.08819,Applied Online,
1,2,49,0,Travel_Frequently,279,Research & Development,8,1,Life Sciences,2,...,10,3,3,10,7,1,7,,,
2,3,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,4,...,7,3,3,0,0,0,0,,Campus,Tech
3,4,33,0,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,5,...,8,3,3,8,7,3,0,,,
4,5,27,0,Travel_Rarely,591,Research & Development,2,1,Medical,7,...,6,3,3,2,2,2,2,,Referral,


In [59]:
# "Unnamed: 0" column will be dropped
data = data.drop(columns=['Unnamed: 0'], errors='ignore')

In [60]:
data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,...,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,SalesRating,HireSource,Campus
0,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,2,...,8,0,1,6,4,0,5,1.08819,Applied Online,
1,49,0,Travel_Frequently,279,Research & Development,8,1,Life Sciences,2,3,...,10,3,3,10,7,1,7,,,
2,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,4,4,...,7,3,3,0,0,0,0,,Campus,Tech
3,33,0,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,5,4,...,8,3,3,8,7,3,0,,,
4,27,0,Travel_Rarely,591,Research & Development,2,1,Medical,7,1,...,6,3,3,2,2,2,2,,Referral,


In [61]:
# Check for missing values
data.isnull().sum()

Age                            0
Attrition                      0
BusinessTravel                 0
DailyRate                      0
Department                     0
DistanceFromHome               0
Education                      0
EducationField                 0
EmployeeNumber                 0
EnvironmentSatisfaction        0
Gender                         0
HourlyRate                     0
JobInvolvement                 0
JobLevel                       0
JobRole                        0
JobSatisfaction                0
MonthlyIncome                  0
MonthlyRate                    0
NumCompaniesWorked             0
OverTime                       0
PercentSalaryHike              0
PerformanceRating              0
RelationshipSatisfaction       0
StandardHours                  0
StockOptionLevel               0
TotalWorkingYears              0
TrainingTimesLastYear          0
WorkLifeBalance                0
YearsAtCompany                 0
YearsInCurrentRole             0
YearsSince

Upon inspection, we see that Sales Ratings are only for salespeople, so this is systematic missingness. The Campus column only indicates certain types of schools, with missing data for employees with unspecified campuses or possibly no college.  Hire Source contains significant missing data, likely Missing Completely at Random, as the missingness does not look to be related to the data itself. 

### Exploratory Data Analysis

In [1]:
import os
print(os.getcwd())

C:\Users\orand


In [2]:
import os
print(os.getcwd())


C:\Users\orand


In [None]:
C:/Users/orand/Homework2.ipynb