# ベースラインの作成

## 説明変数の選択

### RelationshipSatisfaction（交流満足度）

- 人間関係が芳しくないのは一番辛いので転職するのではないかと考えた

### YearsInCurrentRole（現在のロールになってからの年数）

- ずっと同じ仕事をしていると飽きてしまい、転職を考えるのではないかと考えた

### NumCompaniesWorked
- 転職回数が多い人ほど転職に対する障壁が低いため、転職しやすそう


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import lightgbm as lgb
from sklearn.model_selection import train_test_split, cross_val_score

In [2]:
df_train = pd.read_csv("../data/train.csv")

# print(df_train.head())
# print(df_train.describe())
# print(df_train.info())
# print(df_train.isnull().sum())

In [3]:
df_train_normal, df_train_verification = train_test_split(
    df_train, test_size=0.2, random_state=42, shuffle=True
)


X: pd.DataFrame = df_train_normal[
    ["RelationshipSatisfaction", "YearsInCurrentRole", "NumCompaniesWorked"]
]
y: pd.DataFrame = df_train_normal["Attrition"]


print(X.shape, y.shape)

(960, 3) (960,)


In [4]:
model = lgb.LGBMClassifier()

scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")

print("スコア", scores.mean())

[LightGBM] [Info] Number of positive: 133, number of negative: 635
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000382 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 31
[LightGBM] [Info] Number of data points in the train set: 768, number of used features: 3
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.173177 -> initscore=-1.563276
[LightGBM] [Info] Start training from score -1.563276
[LightGBM] [Info] Number of positive: 133, number of negative: 635
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000165 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 31
[LightGBM] [Info] Number of data points in the train set: 768, number of used features: 3
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.173177 -> initscore=-1.563276
[LightGBM] [Info] Start training from score -1.563276
[LightGBM] [Info] Number of po

#### 説明変数における重要度を計算

In [5]:
model.fit(X, y)

# 特徴量の重要度を取得
importance = model.feature_importances_

# 特徴量の名前を取得
feature_names = X.columns

# 重要度を表示
for feature, imp in zip(feature_names, importance):
    print(f"Feature: {feature}, Importance: {imp}")

[LightGBM] [Info] Number of positive: 166, number of negative: 794
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000228 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 32
[LightGBM] [Info] Number of data points in the train set: 960, number of used features: 3
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.172917 -> initscore=-1.565096
[LightGBM] [Info] Start training from score -1.565096
Feature: RelationshipSatisfaction, Importance: 884
Feature: YearsInCurrentRole, Importance: 1172
Feature: NumCompaniesWorked, Importance: 871


### ベースライン検証データ

In [6]:
X_baseline: pd.DataFrame = df_train_verification[
    ["RelationshipSatisfaction", "YearsInCurrentRole", "NumCompaniesWorked"]
]
y_baseline: pd.DataFrame = df_train_verification["Attrition"]

y_baseline_pred = model.predict(X_baseline)

# 正解率を表示
accuracy = (y_baseline == y_baseline_pred).sum() / len(y_baseline)
print(f"Accuracy: {accuracy}")

Accuracy: 0.775


### 回答作成

In [7]:
df_test = pd.read_csv("../data/test.csv")

X_test = df_test[
    ["RelationshipSatisfaction", "YearsInCurrentRole", "NumCompaniesWorked"]
]
id = df_test["id"]

y_pred = model.predict(X_test)

df_out = pd.DataFrame({"id": id, "Attrition": y_pred})
df_out.to_csv("../data/submission.csv", index=False)

### 特徴量絞り込みをしてみる

In [8]:
X = df_train_normal.drop(["Attrition"], axis=1)
y = df_train_normal["Attrition"]