# The Problem
I wanted some practice answering questions based on Human Resource data.

# The Data
I like to see what Kaggle has for datasets because I can usually find something close to the data I want to practice on.  I chose the [Human Resources Data Set](https://www.kaggle.com/datasets/rhuebner/human-resources-data-set/data) by Dr. Carla Patalano and Dr. Rich Huebner. 

Dr. Richard A. Huebner, and Dr. Carla Patalano. (2020). Human Resources Data Set [Data set]. Kaggle. [https://doi.org/10.34740/KAGGLE/DSV/1572001](https://doi.org/10.34740/KAGGLE/DSV/1572001)

# The Approach
Until I am able to work in an environment long enough to become familur with the typical types  of questions asked about an organizations data, I like to ask ChatGPT to provide random questions that might be asked of a data professional based on the dataset I am using. In this instance, ChatGPT came up with a lot questions in seven categories:
- Employee Retention and Turnover
- Performance and Promotions
- Diversity and Inclusion
- Compensation and Benefits
- Workplace Dynamics
- Predictive and Prescriptive Analytics
- Strategic Workforce Planning

For this project, I took one question in each section.
1. Are there specific departments or job roles with higher turnover rates?
2. Do employees who receive regular training or certifications perform better or stay longer?
3. Are there disparities in promotion rates across different gender or age groups?
4. Are there trends in benefits utilization that correlate with retention or performance?
5. Do employees with higher engagement scores tend to stay longer or perform better?
6. Can we build a model to predict which employees are most at risk of leaving within the next 6-12 months?
7. Are there gaps in skills or experience across the workforce that need to be addressed?


I told ChatGPT that I wanted to do a practice data science project

In [27]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rhuebner/human-resources-data-set")

In [39]:
import pandas as pd
pd.set_option('display.max_columns', None)

In [40]:
data = pd.read_csv("C:\\Users\\pixie\\GitHub\hr-practice\\data\\raw\\HRDataset_v14.csv")

In [41]:
data.head()

Unnamed: 0,Employee_Name,EmpID,MarriedID,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Termd,PositionID,Position,State,Zip,DOB,Sex,MaritalDesc,CitizenDesc,HispanicLatino,RaceDesc,DateofHire,DateofTermination,TermReason,EmploymentStatus,Department,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
0,"Adinolfi, Wilson K",10026,0,0,1,1,5,4,0,62506,0,19,Production Technician I,MA,1960,07/10/83,M,Single,US Citizen,No,White,7/5/2011,,N/A-StillEmployed,Active,Production,Michael Albert,22.0,LinkedIn,Exceeds,4.6,5,0,1/17/2019,0,1
1,"Ait Sidi, Karthikeyan",10084,1,1,1,5,3,3,0,104437,1,27,Sr. DBA,MA,2148,05/05/75,M,Married,US Citizen,No,White,3/30/2015,6/16/2016,career change,Voluntarily Terminated,IT/IS,Simon Roup,4.0,Indeed,Fully Meets,4.96,3,6,2/24/2016,0,17
2,"Akinkuolie, Sarah",10196,1,1,0,5,5,3,0,64955,1,20,Production Technician II,MA,1810,09/19/88,F,Married,US Citizen,No,White,7/5/2011,9/24/2012,hours,Voluntarily Terminated,Production,Kissy Sullivan,20.0,LinkedIn,Fully Meets,3.02,3,0,5/15/2012,0,3
3,"Alagbe,Trina",10088,1,1,0,1,5,3,0,64991,0,19,Production Technician I,MA,1886,09/27/88,F,Married,US Citizen,No,White,1/7/2008,,N/A-StillEmployed,Active,Production,Elijiah Gray,16.0,Indeed,Fully Meets,4.84,5,0,1/3/2019,0,15
4,"Anderson, Carol",10069,0,2,0,5,5,3,0,50825,1,19,Production Technician I,MA,2169,09/08/89,F,Divorced,US Citizen,No,White,7/11/2011,9/6/2016,return to school,Voluntarily Terminated,Production,Webster Butler,39.0,Google Search,Fully Meets,5.0,4,0,2/1/2016,0,2


In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311 entries, 0 to 310
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Employee_Name               311 non-null    object 
 1   EmpID                       311 non-null    int64  
 2   MarriedID                   311 non-null    int64  
 3   MaritalStatusID             311 non-null    int64  
 4   GenderID                    311 non-null    int64  
 5   EmpStatusID                 311 non-null    int64  
 6   DeptID                      311 non-null    int64  
 7   PerfScoreID                 311 non-null    int64  
 8   FromDiversityJobFairID      311 non-null    int64  
 9   Salary                      311 non-null    int64  
 10  Termd                       311 non-null    int64  
 11  PositionID                  311 non-null    int64  
 12  Position                    311 non-null    object 
 13  State                       311 non

In [36]:
data.head()

Unnamed: 0,EmpID,MarriedID,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Termd,...,Department,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
0,10026,0,0,1,1,5,4,0,62506,0,...,Production,22.0,LinkedIn,Exceeds,4.6,5,0,1/17/2019,0,1
1,10084,1,1,1,5,3,3,0,104437,1,...,IT/IS,4.0,Indeed,Fully Meets,4.96,3,6,2/24/2016,0,17
2,10196,1,1,0,5,5,3,0,64955,1,...,Production,20.0,LinkedIn,Fully Meets,3.02,3,0,5/15/2012,0,3
3,10088,1,1,0,1,5,3,0,64991,0,...,Production,16.0,Indeed,Fully Meets,4.84,5,0,1/3/2019,0,15
4,10069,0,2,0,5,5,3,0,50825,1,...,Production,39.0,Google Search,Fully Meets,5.0,4,0,2/1/2016,0,2
