# 스마트 공장 제품 품질 상태 분류 AI 온라인 해커톤

- 팀이름 : 김주원, 김민근
- 모델 : Xgboost

# 1. 데이터 분석

- 올바른 ML Model에 학습 시키기 위해서는 정확한 데이터 분석을 해야합니다. 
- 학습(Train) 데이터, 및 테스트(Test) 데이터를 불러와 학습을 합니다. 이 때 컬럼의 갯수와 이름, 타겟 데이터 등을 파악 하여 각 변수들이 어떠한 관계가 있는지 히스토그램, 히트맵, 군집화 등 다양한 데이터 분석기법을 활용하여 분석했습니다. 
- 미리 요약해드리자면 각 변수간의 선형성(linearity)을 나타내는 correlation을 파악해보면 매우 낮은 값을 갖고 있다는 것을 알 수 있었고, 따라서 예측 모델을 만들기 위해서 해당 다중선형회귀 분석 및 회귀분석에는 적합하지 않음을 확인할 수 있었습니다. 
- 군집화를 통해 Store의 종류에 따라 값의 분포가 유의미하게 나누어지는 것을 확인할 수 있습니다. 

## 1.1 패키지 다운로드 

- 데이터 분석 및 ML 모델링에 필요한 패키지를 다운로드 합니다. 사용하는 라이브러니는 pandas, numpy 시각화를 위한 seaborn, matplotlib.pyplot
ML 모델 학습을 위한 sklearn.model_selection의 train_test_split, sklearn.ensemble 의 RandomFroestRegressor, BaggingRegressor, DecisionTreeRegressor, AdaBoostRegressor, Xgboost를 이용해주었습니다.

In [48]:
# 데이터 분석
import pandas as pd
import numpy as np
# 데이터 분석(시각화)
import matplotlib.pyplot as plt 
import seaborn as sns 
# ML 모델링 
from sklearn.model_selection import train_test_split
import skimage
import shap
from xgboost import XGBClassifier

# RMSE 
from sklearn.metrics import f1_score

### Data Information 

**train.csv**

- PRODUCT_ID : 제품의 고유 ID
- Y_Class : 제품 품질 상태(Target) 
  - 0 : 적정 기준 미달 (부적합)
  - 1 : 적합
  - 2 : 적정 기준 초과 (부적합)
- Y_Quality : 제품 품질 관련 정량적 수치
- TIMESTAMP : 제품이 공정에 들어간 시각
- LINE : 제품이 들어간 공정 LINE 종류 ('T050304', 'T050307', 'T100304', 'T100306', 'T010306', 'T010305' 존재)
- PRODUCT_CODE : 제품의 CODE 번호 ('A_31', 'T_31', 'O_31' 존재)
- X_1 ~ X_2875 : 공정 과정에서 추출되어 비식별화된 변수

**test.csv**

- PRODUCT_ID : 제품의 고유 ID
- TIMESTAMP : 제품이 공정에 들어간 시각
- LINE : 제품이 들어간 공정 LINE 종류 ('T050304', 'T050307', 'T100304', 'T100306', 'T010306', 'T010305' 존재)
- PRODUCT_CODE : 제품의 CODE 번호 ('A_31', 'T_31', 'O_31' 존재)
- X_1 ~ X_2875 : 공정 과정에서 추출되어 비식별화된 변수


**sample_submission.csv**

- PRODUCT_ID : 제품의 고유 ID
- Y_Class : 예측한 제품 품질 상태
  - 0 : 적정 기준 미달 (부적합)
  - 1 : 적합
  - 2 : 적정 기준 초과 (부적합)

실제 공정 과정에서의 데이터로, 보안상의 이유로 일부 변수가 비식별화 처리 되었습니다. (X변수)
'LINE', 'PRODUCT_CODE'는 Train / Test 모두 동일한 종류가 존재합니다.

In [59]:
train_data = pd.read_csv("./data/train.csv")
train_data

Unnamed: 0,PRODUCT_ID,Y_Class,Y_Quality,TIMESTAMP,LINE,PRODUCT_CODE,X_1,X_2,X_3,X_4,...,X_2866,X_2867,X_2868,X_2869,X_2870,X_2871,X_2872,X_2873,X_2874,X_2875
0,TRAIN_000,1,0.533433,2022-06-13 5:14,T050304,A_31,,,,,...,39.34,40.89,32.56,34.09,77.77,,,,,
1,TRAIN_001,2,0.541819,2022-06-13 5:22,T050307,A_31,,,,,...,38.89,42.82,43.92,35.34,72.55,,,,,
2,TRAIN_002,1,0.531267,2022-06-13 5:30,T050304,A_31,,,,,...,39.19,36.65,42.47,36.53,78.35,,,,,
3,TRAIN_003,2,0.537325,2022-06-13 5:39,T050307,A_31,,,,,...,37.74,39.17,52.17,30.58,71.78,,,,,
4,TRAIN_004,1,0.531590,2022-06-13 5:47,T050304,A_31,,,,,...,38.70,41.89,46.93,33.09,76.97,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
593,TRAIN_593,1,0.526546,2022-09-08 14:30,T100306,T_31,2.0,95.0,0.0,45.0,...,,,,,,,,,,
594,TRAIN_594,0,0.524022,2022-09-08 22:38,T050304,A_31,,,,,...,49.47,53.07,50.89,55.10,66.49,1.0,,,,
595,TRAIN_595,0,0.521289,2022-09-08 22:47,T050304,A_31,,,,,...,,,,,,1.0,,,,
596,TRAIN_596,1,0.531375,2022-09-08 14:38,T100304,O_31,40.0,94.0,0.0,45.0,...,,,,,,,,,,


In [94]:
test_data = pd.read_csv("./data/test.csv")
test_data

Unnamed: 0,PRODUCT_ID,TIMESTAMP,LINE,PRODUCT_CODE,X_1,X_2,X_3,X_4,X_5,X_6,...,X_2866,X_2867,X_2868,X_2869,X_2870,X_2871,X_2872,X_2873,X_2874,X_2875
0,TEST_000,2022-09-09 2:01,T100306,T_31,2.0,94.0,0.0,45.0,10.0,0.0,...,,,,,,,,,,
1,TEST_001,2022-09-09 2:09,T100304,T_31,2.0,93.0,0.0,45.0,11.0,0.0,...,,,,,,,,,,
2,TEST_002,2022-09-09 8:42,T100304,T_31,2.0,95.0,0.0,45.0,11.0,0.0,...,,,,,,,,,,
3,TEST_003,2022-09-09 10:56,T010305,A_31,,,,,,,...,,,,,,,,,,
4,TEST_004,2022-09-09 11:04,T010306,A_31,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
305,TEST_305,2022-11-05 11:18,T100306,T_31,2.0,91.0,0.0,45.0,10.0,0.0,...,,,,,,,,,,
306,TEST_306,2022-11-05 16:39,T100304,T_31,2.0,96.0,0.0,45.0,11.0,0.0,...,,,,,,,,,,
307,TEST_307,2022-11-05 16:47,T100306,T_31,2.0,91.0,0.0,45.0,10.0,0.0,...,,,,,,,,,,
308,TEST_308,2022-11-05 20:53,T100306,T_31,2.0,95.0,0.0,45.0,10.0,0.0,...,,,,,,,,,,


## 1.3 데이터 전처리

- 결측치를 제거하거나(0으로 채우거나), Label encoding 으로 범주형 데이터를 수치형으로 변환을 하고, 정규화를 진행합니다. 
- pandas에 isna()함수와 sum()함수를 활용하여 각 컬럼에 얼마나 많은 결측치가 있는지 확인하였습니다. 

In [61]:
# 결측치 확인
train_data.isna().sum()

PRODUCT_ID      0
Y_Class         0
Y_Quality       0
TIMESTAMP       0
LINE            0
             ... 
X_2871        499
X_2872        598
X_2873        598
X_2874        598
X_2875        598
Length: 2881, dtype: int64

In [62]:
# 결측치 0으로 채우기
train_data=train_data.fillna(0)
train_data.isna().sum()

PRODUCT_ID    0
Y_Class       0
Y_Quality     0
TIMESTAMP     0
LINE          0
             ..
X_2871        0
X_2872        0
X_2873        0
X_2874        0
X_2875        0
Length: 2881, dtype: int64

In [63]:
print(train_data.shape)
print(train_data.columns)
print(train_data.dtypes)
print("행 열 :", train_data.shape)

(598, 2881)
Index(['PRODUCT_ID', 'Y_Class', 'Y_Quality', 'TIMESTAMP', 'LINE',
       'PRODUCT_CODE', 'X_1', 'X_2', 'X_3', 'X_4',
       ...
       'X_2866', 'X_2867', 'X_2868', 'X_2869', 'X_2870', 'X_2871', 'X_2872',
       'X_2873', 'X_2874', 'X_2875'],
      dtype='object', length=2881)
PRODUCT_ID     object
Y_Class         int64
Y_Quality     float64
TIMESTAMP      object
LINE           object
               ...   
X_2871        float64
X_2872        float64
X_2873        float64
X_2874        float64
X_2875        float64
Length: 2881, dtype: object
행 열 : (598, 2881)


In [95]:
from sklearn.preprocessing import LabelEncoder

# qualitative to quantitative
qual_col = ['LINE', 'PRODUCT_CODE']

for i in qual_col:
    le = LabelEncoder()
    le = le.fit(train_data[i])
    train_data[i] = le.transform(train_data[i])
    
    for label in np.unique(test_data[i]): 
        if label not in le.classes_: 
            le.classes_ = np.append(le.classes_, label)
    test_data[i] = le.transform(test_data[i]) 
print('Done.')

Done.


elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison


In [65]:
train_data = train_data.drop(['PRODUCT_ID', 'TIMESTAMP','Y_Quality'],axis=1)

In [66]:
train_data

Unnamed: 0,Y_Class,LINE,PRODUCT_CODE,X_1,X_2,X_3,X_4,X_5,X_6,X_7,...,X_2866,X_2867,X_2868,X_2869,X_2870,X_2871,X_2872,X_2873,X_2874,X_2875
0,1,2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,39.34,40.89,32.56,34.09,77.77,0.0,0.0,0.0,0.0,0.0
1,2,3,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,38.89,42.82,43.92,35.34,72.55,0.0,0.0,0.0,0.0,0.0
2,1,2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,39.19,36.65,42.47,36.53,78.35,0.0,0.0,0.0,0.0,0.0
3,2,3,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,37.74,39.17,52.17,30.58,71.78,0.0,0.0,0.0,0.0,0.0
4,1,2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,38.70,41.89,46.93,33.09,76.97,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
593,1,5,2,2.0,95.0,0.0,45.0,10.0,0.0,50.0,...,0.00,0.00,0.00,0.00,0.00,0.0,0.0,0.0,0.0,0.0
594,0,2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,49.47,53.07,50.89,55.10,66.49,1.0,0.0,0.0,0.0,0.0
595,0,2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00,0.00,0.00,0.00,0.00,1.0,0.0,0.0,0.0,0.0
596,1,4,1,40.0,94.0,0.0,45.0,11.0,0.0,45.0,...,0.00,0.00,0.00,0.00,0.00,0.0,0.0,0.0,0.0,0.0


In [67]:
train_y = train_data['Y_Class']
train_x = train_data.drop(['Y_Class'],axis=1)

In [68]:
print(train_x.shape, train_y.shape)

(598, 2877) (598,)


## 2.1.1 XGboost

- Xgboost는 앙상블 기법중 Boosting 기법 중 하나로 Gradient Boosting 기법에 속합니다. Gradient Boosting 방식은 에러를 지속적으로 학습하기 때문에 과적합이 될 가능성이 높습니다. 하지만, XGBoost는 Regularization Term이 추가되어 있어 과적합을 줄여줄 수 있습니다. 

> Boosting이란? 앙상블 기법중 하나로오분류된 데이터에 초점을 맞추어 더 많은 가중치를 주는 방식입니다.
<br/>초기에는 모든 데이터가 동일한 가중치를 가지지만, 각 round가 종료된 후 가중치와 중요도를 계산하며, 복원 추출 시에 가중치 분포를 더 많이 고려합니다. <br/>Boosting 기법에는 여러가지 기법이 있지만 그 중에서 Gradient Boosting 기법 중 하나인 **XGboost** 기법을 이용해서 모델링을 진행해주었습니다. 

>Gradient Boosting이란?<br/>Round의 합성 분류기의 데이터 별 오류를 예측하는 약한 분류기를 학습하는 방식으로 쉽게 말해 줄일 수 있는 오차를 학습하여 오차를 줄여나가는 방식입니다. 


In [69]:
xgb_model = XGBClassifier(learning_rate=0.2,
                                 n_estimators=1000,
                                 max_depth=12,
                                 min_child_weight=1,
                                 gamma=0,
                                 colsample_bytree = 0.7,
                                 subsample=0.75,
                                 objective= 'multi:softmax',
                                 nthread=-1,
                                 reg_alpha = 1e-5,
                                 seed=2011)
xgb_model.fit(train_x,train_y)
predict = xgb_model.predict(train_x )

In [70]:
f1_score(train_y, predict, average='macro')

1.0

## Step1. 정확도 향상
- max_depth, min_child_weight, n_estimator 하이퍼파라미터 튜닝

In [77]:
# 정확도 향상
from sklearn.model_selection import GridSearchCV

params = {
 'n_estimators':range(100,1100,100)
}
grid_xgb = GridSearchCV(estimator = XGBClassifier(learning_rate=0.2,
                                                 n_estimators=1000,
                                                 max_depth=12,
                                                 min_child_weight=1,
                                                 gamma=0,
                                                 colsample_bytree = 0.7,
                                                 subsample=0.75,
                                                 objective= 'multi:softmax',
                                                 nthread=-1,
                                                 reg_alpha = 1e-5,
                                                 seed=2011),
                        param_grid = params, n_jobs=-1)
grid_xgb.fit(train_x,train_y)
predict=grid_xgb.predict(train_x)
print(f1_score(train_y, predict, average='macro'))

1.0


In [78]:
print('best parameters : ', grid_xgb.best_params_)

best parameters :  {'n_estimators': 100}


In [79]:
# 정확도 향상
from sklearn.model_selection import GridSearchCV

params = {
 'max_depth':range(3,10,3),
 'min_child_weight':range(1,6,2),
}
grid_xgb = GridSearchCV(estimator = XGBClassifier(learning_rate=0.2,
                                                 n_estimators=100,
                                                 max_depth=12,
                                                 min_child_weight=1,
                                                 gamma=0,
                                                 colsample_bytree = 0.7,
                                                 subsample=0.75,
                                                 objective= 'multi:softmax',
                                                 nthread=-1,
                                                 reg_alpha = 1e-5,
                                                 seed=2011),
                        param_grid = params, n_jobs=-1)
grid_xgb.fit(train_x,train_y)
predict=grid_xgb.predict(train_x)
print(f1_score(train_y, predict, average='macro'))
print('best parameters : ', grid_xgb.best_params_)

1.0
best parameters :  {'max_depth': 3, 'min_child_weight': 5}


### Step2
- gamma 하이퍼파라미터 튜닝

In [81]:
from sklearn.model_selection import GridSearchCV

params = {
    'gamma':[i/10.0 for i in range(0,5)]
}
grid_xgb = GridSearchCV(estimator = XGBClassifier(learning_rate=0.2,
                                                 n_estimators=100,
                                                 max_depth=3,
                                                 min_child_weight=5,
                                                 gamma=0,
                                                 colsample_bytree = 0.7,
                                                 subsample=0.75,
                                                 objective= 'multi:softmax',
                                                 nthread=-1,
                                                 reg_alpha = 1e-5,
                                                 seed=2011),
                        param_grid = params, n_jobs=-1)
grid_xgb.fit(train_x,train_y)
predict=grid_xgb.predict(train_x)
print(f1_score(train_y, predict, average='macro'))
print('best parameters : ', grid_xgb.best_params_)

1.0
best parameters :  {'gamma': 0.0}


### Step3
- colsample_bytree, subsample 하이퍼파라미터 튜닝

In [82]:
from sklearn.model_selection import GridSearchCV

params = {
    'colsample_bytree':[i/10.0 for i in range(6,10)],
    'subsample':[i/100.0 for i in range(40,100)],
    
}
grid_xgb = GridSearchCV(estimator = XGBClassifier(learning_rate=0.2,
                                                 n_estimators=100,
                                                 max_depth=3,
                                                 min_child_weight=5,
                                                 gamma=0,
                                                 colsample_bytree = 0.7,
                                                 subsample=0.75,
                                                 objective= 'multi:softmax',
                                                 nthread=-1,
                                                 reg_alpha = 1e-5,
                                                 seed=2011),
                        param_grid = params, n_jobs=-1)
grid_xgb.fit(train_x,train_y)
predict=grid_xgb.predict(train_x)
print(f1_score(train_y, predict, average='macro'))
print('best parameters : ', grid_xgb.best_params_)

0.9976862401402279
best parameters :  {'colsample_bytree': 0.9, 'subsample': 0.57}


### Step 4. 
- Regularization Parameter 튜닝

In [85]:
from sklearn.model_selection import GridSearchCV

params = {

    'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
grid_xgb = GridSearchCV(estimator = XGBClassifier(learning_rate=0.2,
                                                 n_estimators=100,
                                                 max_depth=3,
                                                 min_child_weight=5,
                                                 gamma=0,
                                                 colsample_bytree = 0.9,
                                                 subsample=0.57,
                                                 objective= 'multi:softmax',
                                                 nthread=-1,
                                                 reg_alpha = 1e-5,
                                                 seed=2011),
                        param_grid = params, n_jobs=-1)
grid_xgb.fit(train_x,train_y)
predict=grid_xgb.predict(train_x)
print(f1_score(train_y, predict, average='macro'))
print('best parameters : ', grid_xgb.best_params_)

0.9976862401402279
best parameters :  {'reg_alpha': 1e-05}


# Learning Rate 
- learning_rate 하이퍼 파라미터 튜닝

In [86]:
from sklearn.model_selection import GridSearchCV

params = {

    'learning_rate':[i/100.0 for i in range(1,100)]
}
grid_xgb = GridSearchCV(estimator = XGBClassifier(learning_rate=0.2,
                                                 n_estimators=100,
                                                 max_depth=3,
                                                 min_child_weight=5,
                                                 gamma=0,
                                                 colsample_bytree = 0.9,
                                                 subsample=0.57,
                                                 objective= 'multi:softmax',
                                                 nthread=-1,
                                                 reg_alpha = 1e-5,
                                                 seed=2011),
                        param_grid = params, n_jobs=-1)
grid_xgb.fit(train_x,train_y)
predict=grid_xgb.predict(train_x)
print(f1_score(train_y, predict, average='macro'))
print('best parameters : ', grid_xgb.best_params_)

0.8221284968272919
best parameters :  {'learning_rate': 0.05}


In [89]:
import tqdm
best_seed = []
for i in tqdm.notebook.tqdm(range(2019)):
    xgb_model=XGBClassifier(learning_rate=0.05,
                                                 n_estimators=100,
                                                 max_depth=3,
                                                 min_child_weight=5,
                                                 gamma=0,
                                                 colsample_bytree = 0.9,
                                                 subsample=0.57,
                                                 objective= 'multi:softmax',
                                                 nthread=-1,
                                                 reg_alpha = 1e-5,
                                                 seed=2011)
    xgb_model.fit(train_x,train_y)
    predict=xgb_model.predict(train_x)
    best_seed.append(((f1_score(train_y, predict, average='macro')**0.5),i))

  0%|          | 0/2019 [00:00<?, ?it/s]

In [91]:
best_seed.sort(reverse=True)
best_seed[0], best_seed[-1]

((0.9067130178988785, 2018), (0.9067130178988785, 0))

# 3. 제출 파일 만들기

In [92]:
xgb_model=XGBClassifier(learning_rate=0.05,
                        n_estimators=100,
                        max_depth=3,
                        min_child_weight=5,
                        gamma=0,
                        colsample_bytree = 0.9,
                        subsample=0.57,
                        objective= 'multi:softmax',
                        nthread=-1,
                        reg_alpha = 1e-5,
                        seed=2018)
xgb_model.fit(train_x,train_y)

In [96]:
test_data = test_data.drop(['PRODUCT_ID', 'TIMESTAMP'],axis=1)
test_data = test_data.fillna(0)
predict = xgb_model.predict(test_data)
sample_submission = pd.read_csv('./data/sample_submission.csv')
sample_submission['Y_Class'] = predict
sample_submission.to_csv('submission.csv',index = False)
sample_submission.head()

Unnamed: 0,PRODUCT_ID,Y_Class
0,TEST_000,1
1,TEST_001,1
2,TEST_002,1
3,TEST_003,1
4,TEST_004,1


In [97]:
set(sample_submission['Y_Class'])

{0, 1, 2}