# 분류 - 직원 이직률

직원 이직률 데이터 활용
- Employee Turnover
- 출처 : [Kaggle - Employee Turnover](https://www.kaggle.com/datasets/davinwijaya/employee-turnover)
- 총 16개 컬럼 중 Target 컬럼은 `event`

---

# Import Libraries & Load data

In [None]:
# Visual Python: Data Analysis > Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
# Visual Python: Data Analysis > File
df = pd.read_csv('./data/Employee_Turnover.csv')
df

# EDA & Data Preprocessing

#### Q. 데이터의 상위 5개 행을 출력하세요.

In [None]:
# Visual Python: Data Analysis > Data Info
df.head()

#### Q. 각 컬럼별 데이터 타입과 데이터 개수를 확인하세요.

In [None]:
# Visual Python: Data Analysis > Data Info
df.info()

In [None]:
# 타겟 컬럼
col_target = 'event'
# 수치형 컬럼
col_num = ['stag', 'age', 'extraversion', 'independ', 'selfcontrol', 'anxiety', 'novator']

#### Q. 각 컬럼의 결측치 수를 확인하세요.

In [None]:
# Visual Python: Data Analysis > Data Info
pd.DataFrame({'Null Count': df.isnull().sum(), 'Non-Null Count': df.notnull().sum()})

#### Q. 수치형 컬럼들의 통계값을 출력하세요.
- 수치형 컬럼: `stag`, `age`, `extraversion`, `independ`, `selfcontrol`, `anxiety`, `novator`

In [None]:
# Visual Python: Data Analysis > Data Info
df.describe()

#### Q. 수치형 컬럼간 상관계수를 확인하시오.

In [None]:
# Visual Python: Data Analysis > Data Info
df.corr(numeric_only=True)

#### Q. 히스토그램을 이용해 수치형 컬럼들의 데이터 분포를 확인하시오.

In [None]:
# Visual Python: Data Analysis > Data Info
df.hist()
plt.show()

#### Q. 박스플롯을 이용해 수치형 컬럼들의 분포와 이상치를 확인하세요.

In [None]:
# Visual Python: Data Analysis > Data Info
df.plot(kind='box')
plt.show()

#### Q. 직원의 이직 여부(`event`)의 비율을 countplot으로 그려보세요.

In [None]:
# Visual Python: Visualization > Seaborn
sns.countplot(data=df, x='event')
plt.show()

#### Q. 직원의 이직 여부(event)를 범주로 `gender`와 `way`의 countplot을 그려보세요.

In [None]:
# Visual Python: Visualization > Seaborn
sns.countplot(data=df, x='gender', hue='event', order=df['gender'].value_counts(ascending=False).index)
plt.show()

In [None]:
# Visual Python: Visualization > Seaborn
sns.countplot(data=df, x='way', hue='event')
plt.show()

#### Q. `gender`, `head_gender`,`greywage` 컬럼을 라벨 인코딩한 후 원본 컬럼은 삭제하세요.

In [None]:
# Visual Python: Data Analysis > Frame
df['gender_label'] = pd.Categorical(df['gender']).codes
df['head_gender_label'] = pd.Categorical(df['head_gender']).codes
df['greywage_label'] = pd.Categorical(df['greywage']).codes
df.drop(['gender','head_gender','greywage'], axis=1, inplace=True)
df

#### Q.  `industry`, `profession`, `traffic`, `coach`, `way` 컬럼을 원핫인코딩 하세요.

In [None]:
# Visual Python: Data Analysis > Frame
df = pd.get_dummies(data=df, columns=['industry','profession','traffic','coach','way'])
df

# Support Vector Machine 모델로 분류하기

#### Q. `df`를 이용해 데이터셋을 train, test로 분리해주세요.

In [None]:
# Visual Python: Machine Learning > Data Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[['stag', 'age', 'extraversion', 'independ', 'selfcontrol', 'anxiety', 'novator', 'gender_label', 'head_gender_label', 'greywage_label', 'industry_ HoReCa', 'industry_Agriculture', 'industry_Banks', 'industry_Building', 'industry_Consult', 'industry_IT', 'industry_Mining', 'industry_Pharma', 'industry_PowerGeneration', 'industry_RealEstate', 'industry_Retail', 'industry_State', 'industry_Telecom', 'industry_etc', 'industry_manufacture', 'industry_transport', 'profession_Accounting', 'profession_BusinessDevelopment', 'profession_Commercial', 'profession_Consult', 'profession_Engineer', 'profession_Finane', 'profession_HR', 'profession_IT', 'profession_Law', 'profession_Marketing', 'profession_PR', 'profession_Sales', 'profession_Teaching', 'profession_etc', 'profession_manage', 'traffic_KA', 'traffic_advert', 'traffic_empjs', 'traffic_friends', 'traffic_rabrecNErab', 'traffic_recNErab', 'traffic_referal', 'traffic_youjs', 'coach_my head', 'coach_no', 'coach_yes', 'way_bus', 'way_car', 'way_foot']], df['event'])

#### Q. Support Vector Machine 모델을 생성하고, fit으로 학습시킨 후 예측 결과를 `pred`에 저장하세요.

In [None]:
# Visual Python: Machine Learning > Classifier
from sklearn.svm import SVC

model_svc = SVC()

In [None]:
# Visual Python: Machine Learning > Fit/Predict
model_svc.fit(X_train, y_train)

In [None]:
# Visual Python: Machine Learning > Fit/Predict
pred = model_svc.predict(X_test)
pred

#### Q. 예측결과인 `pred`를 평가해 정확도(accuracy)와 f1-score를 확인하세요.
- Visual Python: Machine Learning > Evaluation

In [None]:
# Visual Python: Machine Learning > Evaluation
from sklearn import metrics

In [None]:
# Visual Python: Machine Learning > Evaluation
from IPython.display import display, Markdown

In [None]:
# Visual Python: Machine Learning > Evaluation
# Confusion Matrix
display(Markdown('### Confusion Matrix'))
display(pd.crosstab(y_test, pred, margins=True))

In [None]:
# Visual Python: Machine Learning > Evaluation
# Classification report
print(metrics.classification_report(y_test, pred))

# RandomForestClassifier 모델로 분류하기

#### Q. `df`를 이용해 데이터셋을 train, test로 분리해주세요.

In [None]:
# Visual Python: Machine Learning > Data Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[['stag', 'age', 'extraversion', 'independ', 'selfcontrol', 'anxiety', 'novator', 'gender_label', 'head_gender_label', 'greywage_label', 'industry_ HoReCa', 'industry_Agriculture', 'industry_Banks', 'industry_Building', 'industry_Consult', 'industry_IT', 'industry_Mining', 'industry_Pharma', 'industry_PowerGeneration', 'industry_RealEstate', 'industry_Retail', 'industry_State', 'industry_Telecom', 'industry_etc', 'industry_manufacture', 'industry_transport', 'profession_Accounting', 'profession_BusinessDevelopment', 'profession_Commercial', 'profession_Consult', 'profession_Engineer', 'profession_Finane', 'profession_HR', 'profession_IT', 'profession_Law', 'profession_Marketing', 'profession_PR', 'profession_Sales', 'profession_Teaching', 'profession_etc', 'profession_manage', 'traffic_KA', 'traffic_advert', 'traffic_empjs', 'traffic_friends', 'traffic_rabrecNErab', 'traffic_recNErab', 'traffic_referal', 'traffic_youjs', 'coach_my head', 'coach_no', 'coach_yes', 'way_bus', 'way_car', 'way_foot']], df['event'])

#### Q. RandomForestClassifier 모델을 생성하고, fit으로 학습시킨 후 예측 결과를 `pred2`에 저장하세요.

In [None]:
# Visual Python: Machine Learning > Classifier
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()

In [None]:
# Visual Python: Machine Learning > Fit/Predict
model_rf.fit(X_train, y_train)

In [None]:
# Visual Python: Machine Learning > Fit/Predict
pred2 = model_rf.predict(X_test)
pred2

#### Q. 예측결과인 `pred2`를 평가해 정확도(accuracy)와 f1-score를 확인하세요.
- Visual Python: Machine Learning > Evaluation

In [None]:
# Visual Python: Machine Learning > Evaluation
from IPython.display import display, Markdown

In [None]:
# Visual Python: Machine Learning > Evaluation
# Confusion Matrix
display(Markdown('### Confusion Matrix'))
display(pd.crosstab(y_test, pred2, margins=True))

In [None]:
# Visual Python: Machine Learning > Evaluation
# Classification report
print(metrics.classification_report(y_test, pred2))

#### Q. Feature Importance를 차트로 그리세요.

In [None]:
# Visual Python: Machine Learning > Model Info
def vp_create_feature_importances(model, X_train=None, sort=False):
    if isinstance(X_train, pd.core.frame.DataFrame):
        feature_names = X_train.columns
    else:
        feature_names = [ 'X{}'.format(i) for i in range(len(model.feature_importances_)) ]
                        
    df_i = pd.DataFrame(model.feature_importances_, index=feature_names, columns=['Feature_importance'])
    df_i['Percentage'] = 100 * df_i['Feature_importance']
    if sort: df_i.sort_values(by='Feature_importance', ascending=False, inplace=True)
    df_i = df_i.round(2)
                        
    return df_i
def vp_plot_feature_importances(model, X_train=None, sort=False, top_count=0):
    df_i = vp_create_feature_importances(model, X_train, sort)
                        
    if sort: 
        if top_count > 0:
            df_i['Percentage'].sort_values().tail(top_count).plot(kind='barh')
        else:
            df_i['Percentage'].sort_values().plot(kind='barh')
    else: 
        df_i['Percentage'].plot(kind='barh')
    plt.xlabel('Feature importance Percentage')
    plt.ylabel('Features')
                        
    plt.show()

In [None]:
# Visual Python: Machine Learning > Model Info
vp_plot_feature_importances(model_rf, X_train, sort=True, top_count=10)

---

In [None]:
# End of file