# 빅콘테스트 2020 - [챔피언리그] NS SHOP+ 판매실적 예측을 통한 편성 최적화 방안(모형) 도출
  
## 팀명: InsightOut
- 팀장 : 박민형(pminhyung12@naver.com)
- 팀원 : 김우용(zxc1843zzz@naver.com), 박서희(seohuipark95@gmail.com), 문찬호(buddy6274@naver.com)
  
## 코드 순서
- 패키지 및 라이브러리 Import
- 함수생성 - 데이터 불러오기 및 병합
- 함수실행 - 데이터 불러오기 및 병합
- 모델적합을 위한 **변수선택과 파이프라인 함수 생성**
- KNN Imputation
- 상품명(pdname) 벡터화
- 학습을 위한 최종데이터셋 정의
- 하이퍼파라미터 최적화 - 베이지안 최적화
- 모델학습 - **LightGBM**
- 제출파일 생성 - (2020. 06) 테스트 데이터 예측

### 패키지 및 라이브러리 Import

In [207]:
!pip install soynlp
!pip install lightgbm
!pip install bayesian-optimization



In [1]:
### [데이터 전처리를 위한 라이브러리 및 모듈]
import os
import string
import re
from functools import reduce
from datetime import datetime, timedelta, date
import warnings

import numpy as np
import pandas as pd
import soynlp
from soynlp.tokenizer import RegexTokenizer
warnings.filterwarnings('ignore')

### [머신러닝을 위한 라이브러리 및 모듈]
import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder, OrdinalEncoder
import lightgbm
from lightgbm import LGBMRegressor

from bayes_opt import BayesianOptimization

### [서드파티 패키지 버전확인]
print('numpy:', np.__version__)
print('pandas:', pd.__version__)
print('soynlp:', soynlp.__version__)
print('sklearn:', sklearn.__version__)
print('lightgbm:', lightgbm.__version__)

numpy: 1.18.2
pandas: 1.0.3
soynlp: 0.0.493
sklearn: 0.22.2.post1
lightgbm: 2.3.1


### 함수생성 - 데이터 불러오기 및 병합

In [2]:
### [2019, 2020 공휴일 리스트]
hday_19 = [i.replace(' ','-') for i in ['2019 01 01', '2019 02 04', '2019 02 05', '2019 02 06', '2019 03 01', '2019 05 06', '2019 06 06',
'2019 08 15', '2019 09 12', '2019 08 13', '2019 10 03', '2019 10 09', '2019 12 25', '2020 01 01',]]

hday_20 = [i.replace(' ','-') for i in ['2020 06 06']]
### //

### [함수 생성 - 데이터 적재]
def load_NS():
    train = pd.read_excel('train/sales_train.xlsx', skiprows=1)
    test = pd.read_excel('test/sales_test.xlsx', skiprows=1)
    return train, test


def load_categories(train=True):     # 상품카테고리 및 네이버데이터랩 데이터
    if train:
        cat = pd.read_csv('train_category_datalab.csv')[['ratio', 'sub_cate']]
        cat.columns = ['ratio', 'sub_cate']
        return cat
    else:
        cat = pd.read_csv('test_category_datalab.csv')[['ratio', '세분류']]
        cat.columns = ['ratio', 'sub_cate']
        return cat

    
def load_weather(train=True):        # 기상 데이터
    if train:
        return pd.read_csv('2019_weather_dust.csv')
    else:
        return pd.read_csv('2020_weather_dust.csv')

    
def load_kosis(path, ind_col):       # KOSIS 데이터
    df = pd.read_excel(path, index_col = ind_col).T.reset_index()
    df['index'] = df['index'].apply(lambda x: x.replace('. ', '-'))
    df.columns = [col_name.strip() for col_name in df.columns]
    return df


def load_parcel():                   # 배송 데이터
    parcel = pd.read_csv('delivery.csv')[['period', 'delivery']]
    parcel.columns = ['dt_YMD', 'parcel']
    return parcel


def load_mask():                     # 마스크 데이터
    mask = pd.read_csv('mask.csv')[['period', 'ratio']]
    mask.columns = ['dt_YMD', 'mask']
    return mask

### //


### [함수 생성 - 필요 변수 생성]
def preprocess_NS(df, hday):
    df.columns = ['datetime', 'duration', 'mthcode', 'pdcode', 'pdname', 'pdgroup','unitp', 'sales']

    
    # [날짜시간 관련 변수]
    t = ['mon', 'tue', 'wed', 'thur', 'fri', 'sat', 'sun']
    df['datetime'] = pd.to_datetime(df['datetime'])
    df['mth'] = df['datetime'].dt.month
    df['day'] = df['datetime'].dt.day
    df['hour'] = df['datetime'].dt.hour
    df['minute'] = df['datetime'].dt.minute
    df['dt_YMD'] = df['datetime'].apply(lambda x: str(x).split()[0])
    df['wday_num'] = df['datetime'].apply(lambda x: x.weekday())
    df['wday'] = df['datetime'].apply(lambda x: t[x.weekday()])
    df['hday'] = df['wday'].apply(lambda x: 0)
    df.loc[(df['dt_YMD'].isin(hday)) | (df['wday'].isin(['sat', 'sun'])), 'hday'] = 1
    df['hour_168'] = df['wday_num']*24 + df['hour']
    df['week_52'] = df['datetime'].apply(lambda x: date(x.year, x.month, x.day).isocalendar()[1] if x < pd.to_datetime('2019-12-30') else 53)
    
    
    # [프라임타임 binning 변수]
    df['prime']=0
    week_prime_morn = [9, 10, 11]
    week_prime_aft = [16, 17]
    week_prime_even = [20, 21, 22]
    week_not_prime = [1,2,3,4,5,6,7,8,12,13,14,15,18,19,23,24]
    wknd_prime_morn = [8,9,10,11]
    wknd_prime_aft = [13,14,15,16,17]
    wknd_prime_even = [21,22]
    wknd_not_prime = [1,2,3,4,5,6,7,12,18,19,20,23,24]
    
    df.loc[(df['hday']==0)&(df['hour'].isin(week_prime_morn)), 'prime'] = 'week_prime_morn'
    df.loc[(df['hday']==0)&(df['hour'].isin(week_prime_aft)), 'prime'] = 'week_prime_aft'
    df.loc[(df['hday']==0)&(df['hour'].isin(week_prime_even)), 'prime'] = 'week_prime_even'
    df.loc[(df['hday']==0)&(df['hour'].isin(week_not_prime)), 'prime'] = 'week_not_prime'
    df.loc[(df['hday']==1)&(df['hour'].isin(wknd_prime_morn)), 'prime'] = 'wknd_prime_morn'
    df.loc[(df['hday']==1)&(df['hour'].isin(wknd_prime_aft)), 'prime'] = 'wknd_prime_aft'
    df.loc[(df['hday']==1)&(df['hour'].isin(wknd_prime_even)), 'prime'] = 'wknd_prime_even'
    df.loc[(df['hday']==1)&(df['hour'].isin(wknd_not_prime)), 'prime'] = 'wknd_not_prime'
    
    
    # [단위가격 binning 변수] 
    bin_divider=[0, df.unitp.quantile(.25), df.unitp.quantile(.5),\
                 df.unitp.quantile(.75), df.unitp.max()]
                 
    bin_names=['low_price','mid_low_price','mid_high_price','high_price']

    df['unitp_bin']=pd.cut(x=df['unitp'], 
                   bins=bin_divider, 
                   labels=bin_names, 
                   include_lowest=True) 
    
    df['orders']=df['sales']/df['unitp']
    
    
    # [zapping-time 관련 변수]
    duration_df = df.pivot_table(index = ['datetime'], values = 'duration', aggfunc = 'max').reset_index()
    duration_df.columns = ['datetime', 'duration_filled']
    df = pd.merge(df, duration_df, on = 'datetime')
    df['mid_time'] = df['datetime'] + df['duration_filled'].apply(lambda x: timedelta(minutes = x//2))
    df['mid_hour'] = df['mid_time'].dt.hour
    df['mid_minute'] = df['mid_time'].dt.minute
    df['zap_hour'] = (df['mid_time'] + df['mid_minute'].apply(lambda x: timedelta(hours=x//30))).dt.hour
    df['wday_num'] = df['datetime'].apply(lambda x: x.weekday())
    
    
    # [상품명 관련 변수] 
    df['pdname_token'] = df['pdname'].apply(lambda x: [i for i in RegexTokenizer().tokenize(x) if i not in string.punctuation])
    df['sex'] = df['pdname_token'].apply(lambda x: 'male' if ('남성' in x) or ('남아' in x) else ('female' if ('여성' in x ) or ('여아' in x) else 'unisex'))
    df['payment'] = df['pdname_token'].apply(lambda x: 'muisa' if ('무이자' in x) or ('무' in x) else ('ilsibul' if ('일시불' in x) or ('일' in x) else 'pay'))
    df['bargain'] = df['pdname'].apply(lambda x: 'bargain' if ('1+1' in x) or ('파격가' in x) or ('특가' in x) or ('인하' in x) or ('세일' in x) else 'same')
    df['junior'] = df['pdname'].apply(lambda x: 1 if ('주니어' in x) or ('아동' in x) else 0)
    j = re.compile('[\d]+종')
    p = re.compile('[\d]+팩')
    g = re.compile('[\d]+구')
    m = re.compile('[\d]+미')
    iy = re.compile('[\d]+인용')
    bx = re.compile('[\d]+박스')
    df['jong'] = df['pdname'].apply(lambda x: j.findall(x)[0].replace('종', '') if len(j.findall(x))!=0 else
                                    (p.findall(x)[0].replace('팩', '') if len(p.findall(x))!=0 else 
                                     (g.findall(x)[0].replace('구', '') if len(g.findall(x))!=0 else
                                      (m.findall(x)[0].replace('미', '') if len(m.findall(x))!=0 else
                                       (iy.findall(x)[0].replace('인용', '') if len(iy.findall(x))!=0 else
                                        (bx.findall(x)[0].replace('박스', '') if len(bx.findall(x))!=0 else 0
                                        )
                                       )
                                      )
                                     )
                                    )
                                   )
    

    # [방송파트 변수] 
    df['part'] = 0
    df['part_pk'] = df['dt_YMD'] + df['pdcode'].apply(lambda x: str(x))

    for pk in df['part_pk'].unique():
        partpk_df = df[df['part_pk']==pk]
        num = len(partpk_df)
        df.loc[df['part_pk']==pk, 'part'] = [i/num for i in range(1, num+1)]
        

    # [Outlier 처리를 위한 다양한 변수 생성]
    # junior, puma, dickies, uspa, season, yeonggwang, AAC, mask_pdn, hour_rank, underwear_sex
    df['junior'] = df['pdname'].apply(lambda x: 1 if ('주니어' in x) or ('아동' in x) else 0)
    df['puma'] = df['pdname'].apply(lambda x: 1 if ('푸마' in x) else 0)
    df['dickies'] = df['pdname'].apply(lambda x: 1 if ('디키즈' in x) else 0)
    df['uspa'] = df['pdname'].apply(lambda x: 1 if ('USPA' in x) else 0)

    df['season'] = df['mth'].apply(lambda x : 'winter' if ((x==1) or (x==2) or (x==12)) else ('spring' if ((x==3) or (x==4) or (x==5)) else
                                                                                                           ('summer' if ((x==6) or (x==7) or (x==8)) else
                                                                                                            ('fall' if ((x==9) or (x==10) or (x==11)) else 0)
                                                                                                           )))
    df['yeonggwang'] = df['pdname'].apply(lambda x:1 if '영광' in x  else 0)
    df['AAC'] = df['pdname'].apply(lambda x:1 if 'AAC' in x  else 0)
    df['mask_pdn'] = df['pdname'].apply(lambda x:1 if '마스크' in x  else 0)

    
    h_hour_168 ={}
    w_hour_168 ={}
    
    for i in range(2):
        hour_rank = 0
        dic={}
        for hour168 in df[df['hday']==i].groupby(['hour_168']).aggregate(np.mean).sort_values(by = 'orders').index.tolist():
            hour_rank+=1
            dic[hour168] = hour_rank
        if i==0:
            w_hour_168 = dic
        else:
            h_hour_168 = dic


    def input_hour_rank(row_hday, row_hour_168):
        if row_hday==0:
            return w_hour_168[row_hour_168]
        else:
            return h_hour_168[row_hour_168]

    df['hour_rank'] = df.apply(lambda row: input_hour_rank(row.hday, row.hour_168), axis = 1)


    def input_underwear_sex(row_pdgroup, row_pdname):
        if (row_pdgroup=='속옷') & (('드로즈' in row_pdname) or ('트렁크' in row_pdname) or ('남성' in row_pdname) or ('남아' in row_pdname)):
            return 1
        elif (row_pdgroup=='속옷') & (('드로즈' not in row_pdname) and ('트렁크' not in row_pdname) and ('남성' not in row_pdname) and ('남아' not in row_pdname)):
            return 2
        else:
            return 0

    df['underwear_sex'] = df.apply(lambda row: input_underwear_sex(row.pdgroup, row.pdname), axis = 1)
    

    # [KOSIS 데이터병합을 위한 컬럼]
    df['index'] = df['datetime'].apply(lambda x: str(x)[:7])
    

    return df

###//

### [함수생성 = 데이터 병합] 
def merge_all(df, train = True):
    
    
    # [상품카테고리, 네이버데이터랩 병합] 
    if train:
        cat = load_categories(train=True)
    else:
        cat = load_categories(train=False)
        
    df = pd.concat([df, cat], axis = 1)
    
        
    # ["무형" 데이터 제거] 
    df = df[df['pdgroup']!='무형']
    
    
    # [KOSIS 데이터 병합] 
    sales_type_df = load_kosis('KOSIS_소매업태별_판매액.xlsx', '업태별')
    retail_food_df = load_kosis('KOSIS_소매음식판매액지수_불변지수.xlsx', '업종별')
    ind_type_df = load_kosis('KOSIS_소매업태별_판매액지수_불변지수.xlsx', '업태별')
    ind_pd_kosis_df = load_kosis('KOSIS_업태별상품군_판매액지수_불변지수.xlsx', '업태별상품군').drop(['무점포 소매 총지수'], axis = 1)
    
    sales_type_df.columns = ['index', 'kos_nostore_sales']
    retail_food_df.columns = ['index', 'kos_retail_sales_ind', 'kos_catering']
    ind_type_df.columns = ['index', 'kos_online_shop', 'kos_homeshop']
    
    kosis_df_li = [df, sales_type_df, retail_food_df, ind_type_df] 
    df = reduce(lambda  left, right: pd.merge(left, right, on=['index']), kosis_df_li)
    
    def input_ind_pdtype_kosis(row): 
        cat_dict = {'의류': '의복', '속옷': '의복', '주방': '기타상품', '농수축': '음식료품', '이미용': '화장품', '가전': '가전제품', '생활용품': '기타상품',
                   '건강기능': '음식료품', '잡화': '기타상품', '가구': '가구', '침구': '기타상품'}
        pdgroup = row[df.columns.tolist().index('pdgroup')]
        date = row[df.columns.tolist().index('index')]
        row[df.columns.tolist().index('kos_pd_sales_ind')] = ind_pd_kosis_df[ind_pd_kosis_df['index']==date][cat_dict[pdgroup]].values[0]
        return row
    
    df['kos_pd_sales_ind'] = [0]*len(df) 
    df = df.apply(input_ind_pdtype_kosis, axis = 1) 
    
    
    # [기상 데이터 병합]
    if train:
        wther = load_weather(train = True)
    else:
        wther = load_weather(train = False)
    
    df = pd.merge(df, wther, left_on = 'dt_YMD', right_on = 'date')
    

    # [마스크, 배송 데이터 병합]
    parcel = load_parcel()
    mask = load_mask()
    
    df = pd.merge(df, parcel, on = 'dt_YMD')
    df = pd.merge(df, mask, on = 'dt_YMD')

    return df

### //

### 함수실행 - 데이터 불러오기 및 병합
- train: 2019. 01 ~ 2019. 12.
- test : 2020. 06

In [3]:
# NSSHOP 제공데이터 load
ns_train, ns_test = load_NS()

# NSSHOP 제공데이터 preprocessing
pp_train, pp_test = preprocess_NS(ns_train, hday_19), preprocess_NS(ns_test, hday_20)

In [4]:
# 추가데이터셋과 NSSHOP 제공데이터 병합, Train, Test set 준비
train_data = merge_all(pp_train, train = True)
test_data = merge_all(pp_test, train = False) 
print('train_data.shape:', train_data.shape, 'test_data.shape', test_data.shape)

display(train_data.head(2))
display(test_data.head(2))

train_data.shape: (37368, 98) test_data.shape (2708, 98)


Unnamed: 0,datetime,duration,mthcode,pdcode,pdname,pdgroup,unitp,sales,mth,day,...,temp_gwangju,maxtemp_gwangju,mintemp_gwangju,dailycross_gwangju,precip_gwangju,humid_gwangju,cloud_gwangju,finedust,parcel,mask
0,2019-01-01 06:00:00,20.0,100346,201072,테이트 남성 셀린니트3종,의류,39900,2099000.0,1,1,...,0.0,2.4,-2.3,4.7,0.0,65,7.9,38,3.0452,0.04095
1,2019-01-01 06:00:00,,100346,201079,테이트 여성 셀린니트3종,의류,39900,4371000.0,1,1,...,0.0,2.4,-2.3,4.7,0.0,65,7.9,38,3.0452,0.04095


Unnamed: 0,datetime,duration,mthcode,pdcode,pdname,pdgroup,unitp,sales,mth,day,...,temp_seoul,maxtemp_seoul,mintemp_seoul,dailycross_seoul,precip_seoul,humid_seoul,cloud_seoul,finedust,parcel,mask
0,2020-06-01 06:20:00,20.0,100650,201971,잭필드 남성 반팔셔츠 4종,의류,59800,,6,1,...,19.7,24.5,16.6,7.9,0.4,64,3.8,19,6.51407,1.85467
1,2020-06-01 06:40:00,20.0,100650,201971,잭필드 남성 반팔셔츠 4종,의류,59800,,6,1,...,19.7,24.5,16.6,7.9,0.4,64,3.8,19,6.51407,1.85467


### 모델적합을 위한 **변수선택과 파이프라인 함수 생성**

In [5]:
use_features = ['unitp', 'jong', 'ratio',       # numerical
                'kos_nostore_sales', 'kos_retail_sales_ind', 'kos_catering', 'kos_online_shop', 'kos_homeshop', 'kos_pd_sales_ind', 
                'parcel', 'mask', 
                'maxtemp_seoul', 'mintemp_seoul',  'precip_seoul', 'humid_seoul','dailycross_seoul','cloud_seoul', 'finedust', 
                'mthcode', 'pdcode', 'pdgroup', # categorical(one-hot)
                'mth', 'day', 'hour', 'minute', 'wday', 'hday', 'hour_168', 'week_52', 'prime', 'mid_hour', 'mid_minute', 'zap_hour',
                'sex', 'payment', 'bargain', 'sub_cate', 
                'part', 'junior', 'puma', 'dickies', 'uspa', 'season', 'yeonggwang', 'AAC', 'mask_pdn', 'hour_rank', 'underwear_sex',
                'unitp_bin',                    # categorical(ordinal)
                'sales']                        # target
    
numeric_features = ['unitp', 'jong', 'ratio', 
                   'kos_nostore_sales', 'kos_retail_sales_ind', 'kos_catering', 'kos_online_shop','kos_homeshop', 'kos_pd_sales_ind',
                    'parcel', 'mask',
                    'maxtemp_seoul', 'mintemp_seoul', 'dailycross_seoul', 'precip_seoul', 'humid_seoul', 'cloud_seoul', 'finedust']

cat_ohe_features = ['junior', 'puma', 'dickies', 'uspa',
                     'mthcode', 'pdcode', 'pdgroup', 
                    'mth', 'day', 'hour', 'minute', 'wday', 'hday', 
                      'hour_168', 'week_52', 'prime',
                      'mid_hour', 'mid_minute', 'zap_hour',
                    'sex', 'payment', 'bargain', 'sub_cate', 
                   'part' , 'season', 'yeonggwang', 'AAC', 'mask_pdn','hour_rank', 'underwear_sex']

cat_ord_features = ['unitp_bin']

def astype_cols(df):
    """ 변수 데이터타입 변환 후 반환 """
    for numft in numeric_features:
        df[numft] = df[numft].astype(np.float64)
    for catoheft in cat_ohe_features:
        df[catoheft] = df[catoheft].astype(str)
    for catordft in cat_ord_features:
        df[catordft] = df[catordft].astype(str)
    return df

def get_pipeline(use_features, numeric_features, cat_ohe_features, cat_ord_features):
    """ Pipeline 반환 """

    prep_pipe = Pipeline(steps=[('preprocessor', ColumnTransformer(
                                                transformers=[
                                                ('num', StandardScaler(), numeric_features),
                                                ('cat_ohe',  OneHotEncoder(handle_unknown='ignore'), cat_ohe_features),
                                                ('cat_lbl', OrdinalEncoder(), cat_ord_features)])
                                )])
    return prep_pipe

###  KNN Imputation

In [6]:
# KNN
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

knn = KNeighborsRegressor(4, weights = 'distance')

knn_pipe = get_pipeline(use_features, numeric_features, cat_ohe_features, cat_ord_features)

# KNN - X, y
X = train_data[use_features].dropna().drop(['sales'], axis = 1).reset_index(drop=True)
y = train_data[use_features].dropna()['sales'].reset_index(drop=True)

knnx = knn_pipe.fit_transform(astype_cols(X))
knny = y

# KNN - 학습
knn.fit(knnx, np.log1p(knny))

# KNN - 예측(매출액 결측치 대입)
df_missing_sales = train_data[train_data['sales'].isnull()][use_features].drop(['sales'], axis = 1)
df_missing_sales = astype_cols(df_missing_sales)

imputed_sales = np.exp(knn.predict(knn_pipe.transform(df_missing_sales)))
imputed_train_data = train_data[use_features].reset_index(drop=True)
imputed_train_data.loc[imputed_train_data['sales'].isnull(),'sales'] = imputed_sales

imputed_sales

array([ 6644000.09216122, 10574483.20619655,  9543920.6064579 , ...,
        7203683.23936971,  5510860.64105574,  6351388.23719007])

### 상품명(pdname) 벡터화

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

tokens= [li for li in (train_data.pdname_token.tolist() + test_data.pdname_token.tolist())]
print('(train + test) 상품명 data 개수 :',len(tokens))

tfidf = TfidfVectorizer(
    analyzer='word', lowercase = False, preprocessor= lambda x: ' '.join(x),
    max_features = None, min_df = 10)

name_tf = tfidf.fit_transform(tokens)
name_tf.toarray().shape

(train + test) 상품명 data 개수 : 40076


(40076, 1600)

### 학습을 위한 최종데이터셋 정의

In [8]:
### [사용변수에서 상품코드와 마더코드 제외]
use_features = [feature for feature in use_features if (feature!='pdcode') & (feature!='mthcode')]
cat_ohe_features = [feature for feature in cat_ohe_features if (feature!='pdcode') & (feature!='mthcode')]


### [데이터셋 X 변수들과  y(매출액) 정의]
train_X = astype_cols(imputed_train_data[use_features].drop(['sales'], axis = 1).reset_index(drop=True))
train_y = imputed_train_data['sales'].reset_index(drop=True)

test_X = astype_cols(test_data[use_features].drop(['sales'], axis = 1))


### [데이터셋 X 에 pipeline 실행]
prep_pipe = get_pipeline(use_features, numeric_features, cat_ohe_features, cat_ord_features)
prep_pipe.fit(pd.concat([train_X, test_X], axis = 0))

train_X_pipe = prep_pipe.transform(train_X)
test_X_pipe = prep_pipe.transform(test_X)


### [데이터셋 X에 상품명벡터 병합]
train_X_all = np.concatenate((train_X_pipe.toarray(), name_tf.toarray()[:train_X_pipe.shape[0],:]), axis =1)
test_X_all = np.concatenate((test_X_pipe.toarray(), name_tf.toarray()[train_X_pipe.shape[0]:,:]), axis =1)

print('train_X_all shape :', train_X_all.shape, 'train_y shape :', train_y.shape)
print('test_X_all shape :', test_X_all.shape)

train_X_all shape : (37368, 2614) train_y shape : (37368,)
test_X_all shape : (2708, 2614)


### 하이퍼파라미터 최적화 - 베이지안 최적화

In [9]:
def bayesion_opt_lgbm(X, y, init_iter=3, n_iters=7, random_state=11, seed = 101, num_iterations = 100):
    dtrain = lightgbm.Dataset(data=X, label=y)
  
    def hyp_lgbm(num_leaves, max_depth, min_split_gain, min_child_weight, \
                 learning_rate, colsample_bytree, subsample, n_estimators):

        params = {'application':'regression','num_iterations': num_iterations,
                  'early_stopping_round': 50,
                'metric':'rmse'} # Default parameters
        params["num_leaves"] = int(round(num_leaves))
        params['max_depth'] = int(round(max_depth))
        params['min_split_gain'] = min_split_gain
        params['min_child_weight'] = int(min_child_weight)
        params['learning_rate'] = learning_rate,
        params['colsample_bytree']= colsample_bytree
        params['subsample']= subsample
        params['n_estimators']= int(n_estimators)
        
        cv_results = lightgbm.cv(params, dtrain, nfold = 5, seed = seed, categorical_feature=[], stratified=False,
                          verbose_eval =None)
        return  (-1.0 * np.array(cv_results['rmse-mean'])).max()
    

    pds = {'num_leaves': (31, 100),
            'max_depth': (5, 12),
            'min_split_gain': (0.001, 0.1),
            'min_child_weight': (10, 25),
           'learning_rate':(0.2, 0.4),
           'min_child_weight':(1, 8),
           'colsample_bytree':(0.3, 0.8),
           'subsample': (0.8, 1),
           'n_estimators': (500, 1000)
            }

    optimizer = BayesianOptimization(hyp_lgbm, pds, random_state=random_state)
    optimizer.maximize(init_points=init_iter, n_iter=n_iters, acq = 'ei')
    print(optimizer.max['params'])
    return optimizer.res

res = bayesion_opt_lgbm(train_X_all, np.log1p(train_y), init_iter=5, n_iters=25, random_state=77, seed = 101, num_iterations = 200)

|   iter    |  target   | colsam... | learni... | max_depth | min_ch... | min_sp... | n_esti... | num_le... | subsample |
-------------------------------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m-0.407   [0m | [0m 0.7596  [0m | [0m 0.3284  [0m | [0m 10.28   [0m | [0m 1.975   [0m | [0m 0.009645[0m | [0m 894.0   [0m | [0m 53.5    [0m | [0m 0.9082  [0m |
| [0m 2       [0m | [0m-0.4235  [0m | [0m 0.4201  [0m | [0m 0.3091  [0m | [0m 7.804   [0m | [0m 6.006   [0m | [0m 0.08383 [0m | [0m 794.2   [0m | [0m 51.43   [0m | [0m 0.8562  [0m |
| [0m 3       [0m | [0m-0.4264  [0m | [0m 0.6528  [0m | [0m 0.2845  [0m | [0m 5.401   [0m | [0m 6.229   [0m | [0m 0.04578 [0m | [0m 587.9   [0m | [0m 34.41   [0m | [0m 0.8585  [0m |
| [0m 4       [0m | [0m-0.4243  [0m | [0m 0.3334  [0m | [0m 0.3502  [0m | [0m 5.446   [0m | [0m 4.023   [0m | [0m 0.03705 [0m | [0m 576

### 모델학습 - **LightGBM**

In [10]:
### [모델 학습]
model = LGBMRegressor(n_estimators = 900, 
                      max_depth = 10, 
                      learning_rate= 0.32,
                      colsample_bytree = 0.7,
                     gamma = 1
                     )

model.fit(train_X_all, np.log1p(train_y))

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=0.7,
              gamma=1, importance_type='split', learning_rate=0.32,
              max_depth=10, min_child_samples=20, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=900, n_jobs=-1, num_leaves=31,
              objective=None, random_state=None, reg_alpha=0.0, reg_lambda=0.0,
              silent=True, subsample=1.0, subsample_for_bin=200000,
              subsample_freq=0)

### 제출파일 생성 - (2020. 06) 테스트 데이터 예측

In [11]:
test_pred = np.expm1(model.predict(test_X_all))
print('test data 개수:', test_data.shape[0], '\ntest 예측 data 개수:', test_pred.shape[0])

test data 개수: 2708 
test 예측 data 개수: 2708


In [12]:
test_data['sales'] = test_pred

In [13]:
submission = test_data[['datetime', 'duration', 'mthcode', 'pdcode', 'pdname', 'pdgroup', 'unitp', 'sales']]
submission.columns = ['방송일시', '노출(분)', '마더코드', '상품코드', '상품명', '상품군', '판매단가', '취급액']
submission.to_excel('데이터분석분야_챔피언리그_InsightOut_평가데이터답안파일.xlsx', index=False)