# 员工离职预测训练赛
[项目来自数据城堡](http://www.pkbigdata.com/common/cmpt/%E5%91%98%E5%B7%A5%E7%A6%BB%E8%81%8C%E9%A2%84%E6%B5%8B%E8%AE%AD%E7%BB%83%E8%B5%9B_%E7%AB%9E%E8%B5%9B%E4%BF%A1%E6%81%AF.html)

作者：陈坚

---

# 目录
1. 提出问题
2. 理解数据
 * 数据采集
 * 导入数据
 * 查看数据集信息
3. 数据清洗
 * 特征提取
 * 特征选择
4. 构建模型 
5. 模型评估
6. 方案实施
 * 提交结果到数据城堡
 * 结论

# 1.提出问题

现如今的生活节奏越来越快，而一份工作为大多数人的生活提供了保障。但是却仍然有很多人因为种种原因不断的离职、跳槽。影响员工离职的因素可能是：工资、出差、工作环境满意度、工作投入度、是否加班、是否升职、工资提升比例或者家庭情况等等。

**本次研究的问题是：哪些因素更容易影响员工的离职。**

# 2.理解数据

## 2.1 数据采集

[本次数据来源于数据城堡](http://www.pkbigdata.com/common/cmpt/%E5%91%98%E5%B7%A5%E7%A6%BB%E8%81%8C%E9%A2%84%E6%B5%8B%E8%AE%AD%E7%BB%83%E8%B5%9B_%E7%AB%9E%E8%B5%9B%E4%BF%A1%E6%81%AF.html)

## 2.2 导入数据

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
#导入数据
#训练集数据
train=pd.read_csv('pfm_train.csv')
#测试集数据
test=pd.read_csv('pfm_test.csv')

print('训练集:',train.shape)
print('测试集:',test.shape)

训练集: (1100, 31)
测试集: (350, 30)


In [3]:
#合并数据集，方便同时对两个数据集进行清洗
full=pd.concat([train,test],ignore_index=True)
print('合并后的数据集:',full.shape)

合并后的数据集: (1450, 31)


## 2.3 查看数据集信息

In [4]:
#查看数据前5行
full.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,37,0.0,Travel_Rarely,Research & Development,1,4,Life Sciences,77,1,Male,...,3,80,1,7,2,4,7,5,0,7
1,54,0.0,Travel_Frequently,Research & Development,1,4,Life Sciences,1245,4,Female,...,1,80,1,33,2,1,5,4,1,4
2,34,1.0,Travel_Frequently,Research & Development,7,3,Life Sciences,147,1,Male,...,4,80,0,9,3,3,9,7,0,6
3,39,0.0,Travel_Rarely,Research & Development,1,1,Life Sciences,1026,4,Female,...,3,80,1,21,3,3,21,6,11,8
4,28,1.0,Travel_Frequently,Research & Development,1,3,Medical,1111,1,Male,...,1,80,2,1,2,3,1,0,0,0


In [5]:
full.describe()

Unnamed: 0,Age,Attrition,DistanceFromHome,Education,EmployeeNumber,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1450.0,1100.0,1450.0,1450.0,1450.0,1450.0,1450.0,1450.0,1450.0,1450.0,...,1450.0,1450.0,1450.0,1450.0,1450.0,1450.0,1450.0,1450.0,1450.0,1450.0
mean,36.871724,0.161818,9.177241,2.909655,1026.981379,2.722759,2.731724,2.057931,2.731034,6482.624138,...,2.708276,80.0,0.795172,11.217241,2.801379,2.761379,6.956552,4.22,2.16,4.097931
std,9.119033,0.368451,8.085783,1.023925,602.029616,1.090314,0.711068,1.103084,1.103074,4694.115546,...,1.08239,0.0,0.853752,7.738772,1.292009,0.706588,6.053036,3.617954,3.18867,3.546603
min,18.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1009.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,0.0,2.0,2.0,494.25,2.0,2.0,1.0,2.0,2909.5,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,0.0,7.0,3.0,1023.0,3.0,3.0,2.0,3.0,4903.5,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,0.0,14.0,4.0,1559.5,4.0,3.0,3.0,4.0,8339.75,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,2.75,7.0
max,60.0,1.0,29.0,5.0,2068.0,4.0,4.0,5.0,4.0,19999.0,...,4.0,80.0,3.0,40.0,6.0,4.0,37.0,18.0,15.0,17.0


In [6]:
#查看数据字段信息
full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1450 entries, 0 to 1449
Data columns (total 31 columns):
Age                         1450 non-null int64
Attrition                   1100 non-null float64
BusinessTravel              1450 non-null object
Department                  1450 non-null object
DistanceFromHome            1450 non-null int64
Education                   1450 non-null int64
EducationField              1450 non-null object
EmployeeNumber              1450 non-null int64
EnvironmentSatisfaction     1450 non-null int64
Gender                      1450 non-null object
JobInvolvement              1450 non-null int64
JobLevel                    1450 non-null int64
JobRole                     1450 non-null object
JobSatisfaction             1450 non-null int64
MaritalStatus               1450 non-null object
MonthlyIncome               1450 non-null int64
NumCompaniesWorked          1450 non-null int64
Over18                      1450 non-null object
OverTime            

数据集没有缺失值，这是个好消息。Attrition这一列是标签值，用来做机器学习预测的，不需要处理这一列

### 字段解释

（1）Age：员工年龄

（2）Attrition：这是目标预测值，1表示已经离职，0表示未离职

（3）BusinessTravel：商务差旅频率，Non-Travel表示不出差，Travel_Rarely表示不经常出差，Travel_Frequently表示经常出差

（4）Department：员工所在部门，Sales表示销售部，Research & Development表示研发部，Human Resources表示人力资源部

（5）DistanceFromHome：公司跟家庭住址的距离，从1到29，1表示最近，29表示最远

（6）Education：员工的教育程度，从1到5，5表示教育程度最高

（7）EducationField：员工所学习的专业领域

（8）EmployeeNumber：员工号码

（9）EnvironmentSatisfaction：员工对于工作环境的满意程度，从1到4，1的满意程度最低，4的满意程度最高

（10）Gender：员工性别，Male表示男性，Female表示女性

（11）JobInvolvement：员工工作投入度，从1到4，1为投入度最低，4为投入度最高

（12）JobLevel：职业级别，从1到5，1为最低级别，5为最高级别

（13）JobRole：工作角色

（14）JobSatisfaction：工作满意度，从1到4，1代表满意程度最低，4代表满意程度最高

（15）MaritalStatus：员工婚姻状况，Single代表单身，Married代表已婚，Divorced代表离婚

（16）MonthlyIncome：员工月收入

（17）NumCompaniesWorked：员工曾经工作过的公司数

（18）Over18：年龄是否超过18岁

（19）OverTime：是否加班，Yes表示加班，No表示不加班

（20）PercentSalaryHike：工资提高的百分比

（21）PerformanceRating：绩效评估

（22）RelationshipSatisfaction：关系满意度，从1到4，1表示满意度最低，4表示满意度最高

（23）StandardHours：标准工时

（24）StockOptionLevel：股票期权水平

（25）TotalWorkingYears：总工龄

（26）TrainingTimesLastYear：上一年的培训时长，从0到6，0表示没有培训，6表示培训时间最长

（27）WorkLifeBalance：工作与生活平衡程度，从1到4，1表示平衡程度最低，4表示平衡程度最高

（28）YearsAtCompany：在目前公司工作年数

（29）YearsInCurrentRole：在目前工作职责的工作年数

（30）YearsSinceLastPromotion：距离上次升职时长

（31）YearsWithCurrManager：跟目前的管理者共事年数

# 3.数据清洗

## 3.1特征提取
### 分类数据：用数值代替类别one-hot编码

In [7]:
#性别：男性用1表示，女性用0表示
full['Gender']=full.Gender.map({'Male':1,'Female':0})
full.Gender.head()

0    1
1    0
2    1
3    0
4    1
Name: Gender, dtype: int64

In [8]:
#是否加班：1表示加班，0表示不加班
full['OverTime']=full.OverTime.map({'Yes':1,'No':0})
full.OverTime.head()

0    0
1    0
2    1
3    0
4    0
Name: OverTime, dtype: int64

In [9]:
#年龄是否超过18：
full['Over18']=full.Over18.map({'Y':1})
full.Over18.head()

0    1
1    1
2    1
3    1
4    1
Name: Over18, dtype: int64

In [10]:
#出差频率：用get_dummies进行one-hot编码，产生虚拟变量
BusinessTravelDf=pd.get_dummies(full.BusinessTravel,prefix='BT')
BusinessTravelDf.head()

Unnamed: 0,BT_Non-Travel,BT_Travel_Frequently,BT_Travel_Rarely
0,0,0,1
1,0,1,0
2,0,1,0
3,0,0,1
4,0,1,0


In [11]:
#员工所在部门
DepartmentDf=pd.get_dummies(full.Department,prefix='Depart')
DepartmentDf.head()

Unnamed: 0,Depart_Human Resources,Depart_Research & Development,Depart_Sales
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


In [12]:
#专业领域
EducationFieldDf=pd.get_dummies(full.EducationField,prefix='Edu')
EducationFieldDf.head()

Unnamed: 0,Edu_Human Resources,Edu_Life Sciences,Edu_Marketing,Edu_Medical,Edu_Other,Edu_Technical Degree
0,0,1,0,0,0,0
1,0,1,0,0,0,0
2,0,1,0,0,0,0
3,0,1,0,0,0,0
4,0,0,0,1,0,0


In [13]:
#工作角色
JobRoleDf=pd.get_dummies(full.JobRole,prefix='JR')
JobRoleDf.head()

Unnamed: 0,JR_Healthcare Representative,JR_Human Resources,JR_Laboratory Technician,JR_Manager,JR_Manufacturing Director,JR_Research Director,JR_Research Scientist,JR_Sales Executive,JR_Sales Representative
0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,1,0,0,0,0
2,0,0,1,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0,0,0


In [14]:
#婚姻状况
MaritalStatusDf=pd.get_dummies(full.MaritalStatus,prefix='MS')
MaritalStatusDf.head()

Unnamed: 0,MS_Divorced,MS_Married,MS_Single
0,1,0,0
1,1,0,0
2,0,0,1
3,0,1,0
4,1,0,0


In [15]:
#将数据集和产生的虚拟变量合并
fullDf=pd.concat([full,BusinessTravelDf,DepartmentDf,EducationFieldDf,JobRoleDf,MaritalStatusDf],axis=1)

#并删掉原来的分类变量
fullDf.drop(['BusinessTravel','Department','EducationField','JobRole','MaritalStatus'],axis=1,inplace=True)

In [16]:
print('新数据集大小：',fullDf.shape)
print('*'*50)
fullDf.info()

新数据集大小： (1450, 50)
**************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1450 entries, 0 to 1449
Data columns (total 50 columns):
Age                              1450 non-null int64
Attrition                        1100 non-null float64
DistanceFromHome                 1450 non-null int64
Education                        1450 non-null int64
EmployeeNumber                   1450 non-null int64
EnvironmentSatisfaction          1450 non-null int64
Gender                           1450 non-null int64
JobInvolvement                   1450 non-null int64
JobLevel                         1450 non-null int64
JobSatisfaction                  1450 non-null int64
MonthlyIncome                    1450 non-null int64
NumCompaniesWorked               1450 non-null int64
Over18                           1450 non-null int64
OverTime                         1450 non-null int64
PercentSalaryHike                1450 non-null int64
PerformanceRating         

## 3.2 特征选择

相关系数：计算各个特征与标签的相关系数

In [17]:
fullDf.corr().Attrition.sort_values(ascending=False)

Attrition                        1.000000
OverTime                         0.267080
MS_Single                        0.186083
JR_Sales Representative          0.153417
DistanceFromHome                 0.088563
BT_Travel_Frequently             0.081314
Depart_Sales                     0.072324
Edu_Technical Degree             0.063420
JR_Laboratory Technician         0.062296
Edu_Human Resources              0.055427
JR_Human Resources               0.052961
Edu_Marketing                    0.049815
PerformanceRating                0.046762
JR_Research Scientist            0.032271
Depart_Human Resources           0.028385
PercentSalaryHike                0.026604
NumCompaniesWorked               0.025889
Gender                           0.016750
JR_Sales Executive               0.012014
BT_Travel_Rarely                -0.023803
Edu_Life Sciences               -0.023806
Edu_Other                       -0.033936
TrainingTimesLastYear           -0.043395
EmployeeNumber                  -0

可以看到加班（overtime）和离职（attrition）有较高的正相关性，而总工龄（TotalWorkingYears）与attrition有较高的负相关性。

In [18]:
#删除不相关的数据
fullDf.drop(['Over18','StandardHours','EmployeeNumber'],axis=1,inplace=True)

In [19]:
#原始数据集特征
source_X=fullDf[:1100]
source_X.drop('Attrition',axis=1,inplace=True)
source_X.head()

Unnamed: 0,Age,DistanceFromHome,Education,EnvironmentSatisfaction,Gender,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,NumCompaniesWorked,...,JR_Laboratory Technician,JR_Manager,JR_Manufacturing Director,JR_Research Director,JR_Research Scientist,JR_Sales Executive,JR_Sales Representative,MS_Divorced,MS_Married,MS_Single
0,37,1,4,1,1,2,2,3,5993,1,...,0,0,1,0,0,0,0,1,0,0
1,54,1,4,4,0,3,3,3,10502,7,...,0,0,1,0,0,0,0,1,0,0
2,34,7,3,1,1,1,2,3,6074,1,...,1,0,0,0,0,0,0,0,0,1
3,39,1,1,4,0,2,4,4,12742,1,...,0,0,1,0,0,0,0,0,1,0
4,28,1,3,1,1,2,1,2,2596,1,...,1,0,0,0,0,0,0,1,0,0


In [20]:
#原始数据集标签
source_y=fullDf.loc[:1099,'Attrition']
source_y.head()

0    0.0
1    0.0
2    1.0
3    0.0
4    1.0
Name: Attrition, dtype: float64

In [21]:
#预测数据集特征
pred_X=fullDf[1100:]
pred_X.drop('Attrition',axis=1,inplace=True)
pred_X.head()

Unnamed: 0,Age,DistanceFromHome,Education,EnvironmentSatisfaction,Gender,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,NumCompaniesWorked,...,JR_Laboratory Technician,JR_Manager,JR_Manufacturing Director,JR_Research Director,JR_Research Scientist,JR_Sales Executive,JR_Sales Representative,MS_Divorced,MS_Married,MS_Single
1100,40,9,4,3,1,3,2,3,3975,3,...,1,0,0,0,0,0,0,1,0,0
1101,53,7,2,4,0,3,5,3,18606,3,...,0,1,0,0,0,0,0,1,0,0
1102,42,2,4,1,1,2,2,4,6781,3,...,0,0,0,0,0,0,0,0,0,1
1103,34,11,3,3,1,2,2,2,4490,4,...,0,0,0,0,0,0,0,0,1,0
1104,32,1,1,4,1,3,1,1,2956,1,...,0,0,0,0,1,0,0,0,0,1


In [22]:
print('原始集特征：',source_X.shape)
print('原始集标签：',source_y.shape)
print('预测集特征：',pred_X.shape)

原始集特征： (1100, 46)
原始集标签： (1100,)
预测集特征： (350, 46)


# 4.构建模型

## 4.1 建立训练数据集和测试数据集

In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#特征缩放
scaler=StandardScaler()
source_X_scaler=scaler.fit_transform(source_X)
pred_X_scaler=scaler.transform(pred_X)

#将原始集按照4:1分割成训练集和测试集
train_X,test_X,train_y,test_y=train_test_split(source_X_scaler,source_y,test_size=0.2,random_state=123)

In [24]:
print('训练数据集特征：',train_X.shape)
print('测试数据集特征：',test_X.shape)
print('训练数据标签：',train_y.shape)
print('测试数据标签：',test_y.shape)

训练数据集特征： (880, 46)
测试数据集特征： (220, 46)
训练数据标签： (880,)
测试数据标签： (220,)


## 4.2 选择机器学习算法

本次项目选择的是逻辑回归算法

In [25]:
#逻辑回归模型
from sklearn.linear_model import LogisticRegression

#网格搜索
from sklearn.model_selection import GridSearchCV

#利用GridSearch网格搜索选择最优参数
lg=LogisticRegression()
clf=GridSearchCV(lg,param_grid=[{'C':np.arange(0.001,0.05,0.001)}],cv=5)

## 4.3 训练模型

In [26]:
#训练模型，并得到最好的参数C=0.032
clf.fit(train_X,train_y)
best_model=clf.best_estimator_
clf.best_params_

{'C': 0.032}

## 5.评估模型

In [27]:
# 分类问题，score得到的是模型的准确率
best_model.score(test_X,test_y)

0.8772727272727273

# 6.方案实施

## 6.1 将预测结果上传到数据城堡

In [28]:
# predict预测
result=best_model.predict(pred_X_scaler)
result

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 1.

In [29]:
#按照数据城堡规定的格式保存并上传
result=result.astype(int)
resultDf=pd.DataFrame({'result':result})

resultDf.to_csv('result.csv',index=False)

下图是这次案例代码提交后的排名。
![alt text](rank.png)

## 6.2 结论

员工离职率与很多特征相关，例如经常加班的企业员工离职率更高，还有员工的自身情况（如是否单身，年龄大小）都会影响离职率。本次项目只选择了逻辑回归算法，以后有时间的话，采用更多合适的机器学习算法，应该可以提高模型的预测率。第一眼看到这么靠前的排名还是比较惊讶的，这次竞赛是针对入门级别的，所以项目本身比较简单，排名代表不了什么，还要学习的东西还有很多。