# ML Pipeline

* 이제 여러분은 코드를 작성할 때, 두 가지를 고려해야 합니다.
    * 재사용 하려면 어떻게 작성해야 할까?
    * 물 흐르듯이 pipeline을 구성하려면 어떻게 작성해야 할까?

* 여러분은 OO 통신화사 데이터분석가 입니다.
* 회사는 약정기간이 끝난 고객이 번호이동(이탈)해 가는 문제를 해결하고자 합니다.
* 그래서 여러분에게, 어떤 고객이 번호이동(이탈)해 가는지 데이터분석을 의뢰하였습니다.
* 고객 이탈여부(CHURN)에 영향을 주는 요인을 찾아 봅시다.

![](https://d18lkz4dllo6v2.cloudfront.net/cumulus_uploads/entry/23964/mobile%20phones.png)

|변수 명|내용|구분|
|	----	|	----	|	----	|
|	COLLEGE	|	대학졸업 여부(1,0) - 범주	|		|
|	INCOME	|	연 수입액(달러)	|		|
|	OVERAGE	|	월 초과사용 시간(분)	|		|
|	LEFTOVER	|	월 사용 잔여시간비율(%)	|		|
|	HOUSE	|	집 가격(달러)	|		|
|	HANDSET_PRICE	|	핸드폰 가격(달러)	|		|
|	OVER_15MINS_CALLS_PER_MONTH	|	 평균 장기통화(15분 이상) 횟수	|		|
|	AVERAGE_CALL_DURATION	|	평균 통화시간(분)	|		|
|	REPORTED_SATISFACTION	|	만족도 설문('very_unsat', 'unsat', 'avg', 'sat', 'very_sat' ) - 범주	|		|
|	REPORTED_USAGE_LEVEL	|	사용 수준 설문('very_little', 'little', 'avg', 'high', 'very_high') - 범주	|		|
|	CONSIDERING_CHANGE_OF_PLAN	|	변경 계획 설문('never_thought', 'no', 'perhaps', 'considering',   'actively_looking_into_it') - 범주	|		|
|	**CHURN**	|	이탈여부(1 : 이탈, 0 : 잔류)	|	**Target**	|


## 0.환경준비 

### 1) 라이브러리 

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import MinMaxScaler

from sklearn.svm import SVC
from sklearn.metrics import classification_report

### 2) 데이터 불러오기

In [2]:
use_cols = ['COLLEGE', 'INCOME', 'OVERAGE', 'LEFTOVER', 'HOUSE', 'HANDSET_PRICE', 'OVER_15MINS_CALLS_PER_MONTH', 'AVERAGE_CALL_DURATION',
            'REPORTED_SATISFACTION', 'REPORTED_USAGE_LEVEL', 'CONSIDERING_CHANGE_OF_PLAN', 'CHURN']
data = pd.read_csv('data/mobile.csv', usecols = use_cols )
data.head()

Unnamed: 0,COLLEGE,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL,CONSIDERING_CHANGE_OF_PLAN,CHURN
0,1,47711,183,17,730589.0,192,19,5,unsat,little,considering,0
1,0,74132,191,43,535092.0,349,15,2,unsat,very_little,no,1
2,1,150419,0,14,204004.0,682,0,6,unsat,very_high,considering,0
3,0,159567,0,58,281969.0,634,1,1,very_unsat,very_high,never_thought,0
4,1,23392,0,0,216707.0,233,0,15,unsat,very_little,no,1


## 2.데이터 전처리

### 1) 불필요한 데이터 처리
처음부터 꼭 필요한 칼럼만 지정하여 불러오는 것이 좋습니다.

### 2) 데이터 분할

#### x, y 분할

In [3]:
target = 'CHURN'
x = data.drop(target, axis=1)
y = data[target]

#### test 분할

여기서는 조금만 떼어 냅시다.

In [4]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15, random_state=2022)

#### train, val 분할

In [5]:
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=2022)

### 3) Feature Engineering

### 4) NaN 조치①

* 먼저 x의 NaN을 조사해 봅시다.

In [6]:
x_train.isna().sum()

COLLEGE                         0
INCOME                          0
OVERAGE                         0
LEFTOVER                        0
HOUSE                           6
HANDSET_PRICE                   0
OVER_15MINS_CALLS_PER_MONTH     0
AVERAGE_CALL_DURATION           0
REPORTED_SATISFACTION          30
REPORTED_USAGE_LEVEL            0
CONSIDERING_CHANGE_OF_PLAN      0
dtype: int64

#### SimpleImputer 

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [7]:
from sklearn.impute import SimpleImputer

imputer1_list = ['REPORTED_SATISFACTION']

imputer1 = SimpleImputer(strategy='most_frequent')

x_train[imputer1_list] = imputer1.fit_transform(x_train[imputer1_list])

In [8]:
x_train.isna().sum()

COLLEGE                        0
INCOME                         0
OVERAGE                        0
LEFTOVER                       0
HOUSE                          6
HANDSET_PRICE                  0
OVER_15MINS_CALLS_PER_MONTH    0
AVERAGE_CALL_DURATION          0
REPORTED_SATISFACTION          0
REPORTED_USAGE_LEVEL           0
CONSIDERING_CHANGE_OF_PLAN     0
dtype: int64

#### validation set에 적용하기

In [9]:
x_val[imputer1_list] = imputer1.fit_transform(x_val[imputer1_list])

In [10]:
x_val.isna().sum()

COLLEGE                        0
INCOME                         0
OVERAGE                        0
LEFTOVER                       0
HOUSE                          3
HANDSET_PRICE                  0
OVER_15MINS_CALLS_PER_MONTH    0
AVERAGE_CALL_DURATION          0
REPORTED_SATISFACTION          0
REPORTED_USAGE_LEVEL           0
CONSIDERING_CHANGE_OF_PLAN     0
dtype: int64

### 5) 가변수화

In [11]:
cat = {'REPORTED_SATISFACTION':['very_unsat', 'unsat', 'avg', 'sat', 'very_sat'],
       'REPORTED_USAGE_LEVEL':['very_little', 'little', 'avg', 'high', 'very_high'],
       'CONSIDERING_CHANGE_OF_PLAN':['never_thought', 'no', 'perhaps', 'considering', 'actively_looking_into_it']}

def mobile_to_categorical(df, cat) :
    tmp = df.copy()
    for k, v in cat.items() :
        tmp[k] = pd.Categorical(tmp[k], categories=v, ordered=False)
    tmp = pd.get_dummies(tmp, columns=cat.keys(), drop_first=True)
    return tmp

In [12]:
x_train = mobile_to_categorical(x_train, cat)

#### validation set에 적용하기

In [13]:
x_val = mobile_to_categorical(x_val, cat)

### 6) 스케일링


In [14]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

x_train_s = scaler.fit_transform(x_train)

#### validation set에 적용하기

In [15]:
x_val_s = scaler.transform(x_val)

### 7) NaN 조치②

#### KNNImputer
https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html

In [16]:
from sklearn.impute import KNNImputer

col_list = list(x_train)

imputer2 = KNNImputer()
x_train_s = imputer2.fit_transform(x_train_s)

x_train_s = pd.DataFrame(x_train_s, columns=col_list)

In [17]:
x_train_s.isna().sum()

COLLEGE                                                0
INCOME                                                 0
OVERAGE                                                0
LEFTOVER                                               0
HOUSE                                                  0
HANDSET_PRICE                                          0
OVER_15MINS_CALLS_PER_MONTH                            0
AVERAGE_CALL_DURATION                                  0
REPORTED_SATISFACTION_unsat                            0
REPORTED_SATISFACTION_avg                              0
REPORTED_SATISFACTION_sat                              0
REPORTED_SATISFACTION_very_sat                         0
REPORTED_USAGE_LEVEL_little                            0
REPORTED_USAGE_LEVEL_avg                               0
REPORTED_USAGE_LEVEL_high                              0
REPORTED_USAGE_LEVEL_very_high                         0
CONSIDERING_CHANGE_OF_PLAN_no                          0
CONSIDERING_CHANGE_OF_PLAN_perh

#### validation set에 적용하기

In [18]:
x_val_s = imputer2.transform(x_val_s)

x_val_s = pd.DataFrame(x_val_s, columns=col_list)

## 3.모델링


In [19]:
from sklearn.svm import SVC

model = SVC()
model.fit(x_train_s, y_train)

SVC()

In [20]:
from sklearn.metrics import classification_report

y_pred = model.predict(x_val_s)
print(classification_report(y_val, y_pred))

              precision    recall  f1-score   support

           0       0.64      0.73      0.68      1511
           1       0.69      0.59      0.64      1515

    accuracy                           0.66      3026
   macro avg       0.67      0.66      0.66      3026
weighted avg       0.67      0.66      0.66      3026



## 4.Data Pipeline 정리

* 이제 최적의 모델이 생성되어, 운영시스템에 배포되었습니다.
* 운영에서 new data가 주어졌을 때, 어떤 절차로 파이프라인을 구성해야 할까요?

In [21]:
# new_data : x_test

x_test.head()

Unnamed: 0,COLLEGE,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL,CONSIDERING_CHANGE_OF_PLAN
17631,1,27296,104,18,293757.0,181,5,5,very_sat,little,no
548,1,89370,78,13,266922.0,344,4,5,unsat,avg,perhaps
9178,1,39363,0,23,704168.0,217,1,4,very_unsat,little,no
17249,0,151891,0,14,829930.0,508,0,5,very_unsat,very_little,considering
5057,0,94895,74,77,727593.0,307,3,4,unsat,little,considering


### 1) [validation에 적용하기] 코드들 가져오기

* 함수, 변수 선언

In [22]:
def mobile_to_categorical(df, cat) :
    tmp = df.copy()
    for k, v in cat.items() :
        tmp[k] = pd.Categorical(tmp[k], categories=v, ordered=False)
    tmp = pd.get_dummies(tmp, columns=cat.keys(), drop_first=True)
    return tmp

col_list = list(x_train)

imputer1_list = ['REPORTED_SATISFACTION']

cat = {'REPORTED_SATISFACTION':['very_unsat', 'unsat', 'avg', 'sat', 'very_sat'],
       'REPORTED_USAGE_LEVEL':['very_little', 'little', 'avg', 'high', 'very_high'],
       'CONSIDERING_CHANGE_OF_PLAN':['never_thought', 'no', 'perhaps', 'considering', 'actively_looking_into_it']}

* 전처리 실행

In [23]:
tmp = x_test.copy()

In [24]:
# simpleImputer
tmp[imputer1_list] = imputer1.transform(tmp[imputer1_list])

# 가변수화
tmp = mobile_to_categorical(tmp, cat)

# 스케일링
tmp = scaler.transform(tmp)

# KNNImputer
tmp = imputer2.transform(tmp)

# DataFrame 변환
tmp = pd.DataFrame(tmp, columns=col_list)

tmp.head()

Unnamed: 0,COLLEGE,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION_unsat,REPORTED_SATISFACTION_avg,REPORTED_SATISFACTION_sat,REPORTED_SATISFACTION_very_sat,REPORTED_USAGE_LEVEL_little,REPORTED_USAGE_LEVEL_avg,REPORTED_USAGE_LEVEL_high,REPORTED_USAGE_LEVEL_very_high,CONSIDERING_CHANGE_OF_PLAN_no,CONSIDERING_CHANGE_OF_PLAN_perhaps,CONSIDERING_CHANGE_OF_PLAN_considering,CONSIDERING_CHANGE_OF_PLAN_actively_looking_into_it
0,1.0,0.052084,0.31454,0.202247,0.169083,0.06632,0.172414,0.285714,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,1.0,0.495638,0.237389,0.146067,0.137506,0.278283,0.137931,0.285714,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.13831,0.005935,0.258427,0.652017,0.113134,0.034483,0.214286,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.942385,0.005935,0.157303,0.800003,0.491547,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.535117,0.225519,0.865169,0.679582,0.230169,0.103448,0.214286,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


### 2) Data Pipeline 함수 만들고 실행하기

In [25]:
def mobile_datapipeline(df, simpleimputer, simpleimputer_list, dump_list, scaler, knnimputer) :
    tmp = df.copy()
    
    # simpleImputer
    tmp[simpleimputer_list] = simpleimputer.transform(tmp[simpleimputer_list])

    # 가변수화
    tmp = mobile_to_categorical(tmp, dump_list)
    
    x_col = list(tmp)

    # 스케일링
    tmp = scaler.transform(tmp)

    # KNNImputer
    tmp = knnimputer.transform(tmp)

    # DataFrame 변환
    return pd.DataFrame(tmp, columns=x_col)

In [26]:
input_data = mobile_datapipeline(x_test, imputer1, imputer1_list, cat, scaler, imputer2).head()

In [27]:
model.predict(input_data)

array([1, 0, 0, 0, 0], dtype=int64)

## 5.파이썬 오브젝트 저장하기

In [28]:
import joblib

In [34]:
use_cols = ['COLLEGE', 'INCOME', 'OVERAGE', 'LEFTOVER', 'HOUSE', 'HANDSET_PRICE', 'OVER_15MINS_CALLS_PER_MONTH', 'AVERAGE_CALL_DURATION',
            'REPORTED_SATISFACTION', 'REPORTED_USAGE_LEVEL', 'CONSIDERING_CHANGE_OF_PLAN']
# 변수 저장
joblib.dump(imputer1_list, 'preprocess/simpleimputer_list.plk')
joblib.dump(cat, 'preprocess/dumm_list.plk')
joblib.dump(use_cols, 'preprocess/use_cols.plk')

['preprocess/use_cols.plk']

In [30]:
# preprocess 저장
joblib.dump(imputer1, 'preprocess/simpleimputer.plk')
joblib.dump(scaler, 'preprocess/minmaxscaler.plk')
joblib.dump(imputer2, 'preprocess/knnimputer.plk')

['preprocess/knnimputer.plk']

In [31]:
# model 저장
joblib.dump(model, 'model/model.plk')

['model/model.plk']

In [32]:
import datetime
# 모델 버전 관리용
now = datetime.datetime.now()
timestamp = now.strftime('%Y%m%d_%H%M%S')

joblib.dump(imputer1, 'preprocess/simpleimputer_'+timestamp+'.plk')
joblib.dump(scaler, 'preprocess/minmaxscaler_'+timestamp+'.plk')
joblib.dump(imputer2, 'preprocess/knnimputer'+timestamp+'.plk')
joblib.dump(model, 'model/model'+timestamp+'.plk')

['model/model20220401_110330.plk']

In [None]:
from sklearn.metrics import accuracy_score