--- 

># **코드설명**

---

- 파 일 명 :  <br>
- 시작날짜 :  <br>
- 수정날짜 :  <br>
- 작 성 자 : 김혁진 <br>
- 작성주제 :  <br>

--- 

- **참조**

  (1) 대회 홈페이지 : [Dacon](https://dacon.io/competitions/official/235848/overview/description) <br>
  (2) 하이퍼 파리미터 설명 : [Naver Blog](https://blog.naver.com/wideeyed/221333529176) <br>
  (3) Class문 설명 : [Github](https://zzsza.github.io/development/2020/07/05/python-class/) <br>
  (4) GPU 설정 : [Medium](https://medium.com/@am.sharma/lgbm-on-colab-with-gpu-c1c09e83f2af) <br>
  (5) RAM 모두사용으로 세션다운 : [Tistory](https://somjang.tistory.com/entry/Google-Colab-%EC%9E%90%EC%A3%BC%EB%81%8A%EA%B8%B0%EB%8A%94-%EB%9F%B0%ED%83%80%EC%9E%84-%EB%B0%A9%EC%A7%80%ED%95%98%EA%B8%B0)

---

- **고려사항** <br>
  (1) AutoEncoder로 파생변수 생성해보기 <br>
  (2) 하이퍼파라미터 탐색 : grid-search, bayesian-optimization, [optuna](https://dacon.io/competitions/official/235713/codeshare/2704?page=1&dtype=recent)

---

># **기본설정**

<br></br>
## Markdown : Tabular Left Align

In [1]:
%%html
<style>
    table {float:left}
</style>

<br></br>
## Jupyter Notebook Style : Theme, Display

In [2]:
# # theme 설치
# !pip install jupyterthemes

# # jupyter notebook 최신버전
# !pip install --upgrade notebook

# # jupyter notebook 최신버전
# !pip install --upgrade jupyterthemes

# 2.2.1. 테마바꾸기(customizing)
# !jt -t onedork -fs 115 -nfs 125 -tfs 115 -dfs 115 -ofs 115 -cursc r -cellw 80% -lineh 115 -altmd  -kl -T -N

# 2.2.2. 쥬피터 노트북 화면 넓게 사용
# 출처: https://taehooh.tistory.com/entry/Jupyter-Notebook-주피터노트북-화면-넓게-쓰는방법
# from IPython.core.display import display, HTML 
# display(HTML("<style>.container { width:80% !important; }</style>"))

# # 2.2.3. 좌측 TOC 만들기
# # 출처 : https://gmnam.tistory.com/246
# pip install jupyter_nbextensions_configurator
# pip install jupyter_contrib_nbextensions

# jupyter nbextensions_configurator enable --user
# jupyter contrib nbextension install --user

In [3]:
# # 2.3.1 Google Drive Mount
# # (Google Drive 사용 시 설정)
# from google.colab import drive
# drive.mount('/content/drive', force_remount = True) # 새로운 창에서 key 를 받아서 입력해야합니다. 

# # 2.3.2. 메모리 에러
# https://growingsaja.tistory.com/477

In [4]:
# # 2.3.3. GPU 사용 (6분)
# !git clone --recursive https://github.com/Microsoft/LightGBM
# !mkdir build
# %cd /content/LightGBM
# !cmake -DUSE_GPU=1 #avoid ..
# !make -j$(nproc)
# !sudo apt-get -y install python-pip
# !sudo -H pip install setuptools pandas numpy scipy scikit-learn -U
# %cd /content/LightGBM/python-package

<br></br>
##### Install Modules

In [5]:
# !pip uninstall pandas -y
# !pip uninstall numpy  -y
# !pip uninstall lightgbm -y

# !pip install pandas==1.1.0
# !pip install numpy==1.21.2
# !pip install -U scikit-learn
# !pip install lightgbm --install-option=--gpu

# !pip install pandasql
# !pip install seaborn
# !pip install plotnine
# !pip install pandasql

# lightgbm 에러떴는데, 콘다에서 실행하면 해결됨
# conda install -c conda-forge lightgbm 

# bayesian optimization 설치
# !pip install bayesian-optimization

<br></br>
## Import Modules

In [6]:
# jupyter notebook 전용
from tqdm.notebook import tqdm
# from tqdm import tqdm

# basic modules
import pandas as pd
import numpy as np
import math
import warnings
import random
import os
import time

# value_counts() 범용적인 버전
from collections import Counter as cnt


# plotting
import seaborn as sns
sns.set(rc={'figure.figsize':(11.7, 8.27)})
sns.set_style('whitegrid')

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [11.7, 8.27] # [15, 10] # [11.7,8.27] - A4 size

from plotnine import *


# sqldf
from pandasql import sqldf
sql = lambda q: sqldf(q, globals())


# modeling
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import cross_val_score,StratifiedKFold
from sklearn.metrics import f1_score

# import lightgbm
# !pip install lightgbm --install-option=--gpu --install-option="--opencl-include-dir=/usr/local/cuda/include/" --install-option="--opencl-library=/usr/local/cuda/lib64/libOpenCL.so"
import lightgbm as lgb
from lightgbm import LGBMClassifier


# Hyperparameter Optimization
from bayes_opt import BayesianOptimization

<br></br>
## Initial Values

In [12]:
# 2.4.1. Data Path
# jupyter.notebook : 'os.getcwd() + '/DAT/블랙 프라이데이 판매 예측/''
# google.colab     : '/content/drive/MyDrive/Python/4. 블랙프라이데이 판매예측/DAT/'
DATA_PATH = os.getcwd() + '/DAT/2. 심장 질환 예측 경진대회 (Dacon)/'
OUT_PATH  = os.getcwd() + '/OUT/2. 심장 질환 예측 경진대회 (Dacon)/'

# 2.4.2. set seed
SEED = 777

# 2.4.3. plot
PLOT = True

# 2.4.4. scaling
SCALE = True

# 2.4.5. interaction term
INTERACTION = True # True

# initial value save
ini_var = ['SEED','PLOT','SCALE','INTERACTION']

<br></br>
## Set Off the Warning

In [8]:
pd.set_option('mode.chained_assignment', None)
warnings.filterwarnings(action='ignore')

<br></br>
## User Defined Function

In [13]:
#-------------------------------------------------------------------------------------------------------#
# 2.6.1. Seed Fix
#-------------------------------------------------------------------------------------------------------#
def seed_everything(seed: int = 1):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    # torch.manual_seed(seed)
    # torch.cuda.manual_seed(seed)  # type: ignore
    # torch.backends.cudnn.deterministic = True  # type: ignore
    # torch.backends.cudnn.benchmark = True  # type: ignore
    
seed_everything(SEED)

#-------------------------------------------------------------------------------------------------------#
# 2.6.2. View all columns
#-------------------------------------------------------------------------------------------------------#
def View(data):

    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    
    print(data)

    pd.set_option('display.max_rows', 0)
    pd.set_option('display.max_columns', 0)
    pd.set_option('display.width', 0)

#-------------------------------------------------------------------------------------------------------#
# 2.6.3. minmax function
#-------------------------------------------------------------------------------------------------------#
def minmax(x):
    return min(x),max(x)

#-------------------------------------------------------------------------------------------------------#
# 2.6.4. 컬럼dict에서 target 제거
#-------------------------------------------------------------------------------------------------------#
# - dict : 기준 dict
# - key  : 삭제할 key
#-------------------------------------------------------------------------------------------------------#
def rmkey(dict, key):
    tmp = dict.copy()
    del tmp[key]
    return tmp

#-------------------------------------------------------------------------------------------------------#
# 2.6.5. 각 컬럼의 missing 개수를 파악하는 함수
#-------------------------------------------------------------------------------------------------------#
# - data     : 기준 data
# - col_type : {column명 : type}로 이루어진 dictionary
#-------------------------------------------------------------------------------------------------------#
def missing_column_check(data, col_type):
    n_na = []
    n_na_type = []
    for col_nm in data.columns:
        data[col_nm] = data[col_nm].astype(col_type[col_nm])

        # str인 경우에는 blank(공백)도 있는지 확인
        if col_type[col_nm]==str:

            isnull_cnt = data[col_nm].str.strip().isnull().sum()
            blank_cnt  = sum(data[col_nm].str.strip()=='')
            nan_cnt    = sum(data[col_nm].str.strip()=='nan')
            null_cnt   = sum(data[col_nm].str.strip()=='null')

            n_na_x = isnull_cnt+blank_cnt+nan_cnt+null_cnt
            n_na.append(n_na_x)
            
            if n_na_x>0:
                n_na_type_x=[]
                if isnull_cnt>0: n_na_type_x.append('isnull')
                if blank_cnt >0: n_na_type_x.append('blank')
                if nan_cnt   >0: n_na_type_x.append('nan')
                if null_cnt  >0: n_na_type_x.append('null')
                n_na_type_x = '+'.join(n_na_type_x)
            else:
                n_na_type_x = ''
            n_na_type.append(n_na_type_x)
            

        # numeric인 경우에는 null의 개수만 확인
        else:
            n_na_x = data[col_nm].isnull().sum()
            n_na.append(n_na_x)
            
            if n_na_x>0:
                n_na_type.append('isnull')
            else:
                n_na_type.append('')
            
    res_df = pd.DataFrame({
        'col'  : data.columns,
        'n_na' : n_na,
        'n_n_ratio' : [str(round(n/len(data)*100,1))+'%' for n in n_na],
        'na_type' : n_na_type,
        'col_type' : [COL_TYPE[col].__name__ for col in data.columns]
        })

    res_df = res_df[res_df['n_na']>0]
    if len(res_df)==0:
        return('Dataset does not have a null value')
    else:
        return(res_df)

#-------------------------------------------------------------------------------------------------------#
# 2.6.6. 교호작용항 추가
#-------------------------------------------------------------------------------------------------------#
# - data     : 기준 data
# - num_vari : 숫자형 변수 list
#-------------------------------------------------------------------------------------------------------#
def interaction_term(data,num_vari):

    num_var = list(set(num_vari) - set(['id']))

    for i in range(0,len(num_var)):
        for j in range(i,len(num_var)):
            data[f'{num_var[i]}*{num_var[j]}'] = data[f'{num_var[i]}']*data[f'{num_var[j]}']

    return(data)

#-------------------------------------------------------------------------------------------------------#
# 2.6.7. color when print
#-------------------------------------------------------------------------------------------------------#
class color:
    PURPLE    = '\033[95m'
    CYAN      = '\033[96m'
    DARKCYAN  = '\033[36m'
    BLUE      = '\033[94m'
    GREEN     = '\033[92m'
    YELLOW    = '\033[93m'
    RED       = '\033[91m'
    BOLD      = '\033[1m'
    UNDERLINE = '\033[4m'
    END       = '\033[0m'

#-------------------------------------------------------------------------------------------------------#
# 2.6.8. density plot : histogram + density plot
#-------------------------------------------------------------------------------------------------------#
# - data : 기준 data
# - vars : hist + kde를 그릴 숫자형 변수
# - hue  : group화 변수
# - binwidth_adj_ratio : binwidth 조정 비율
#-------------------------------------------------------------------------------------------------------#
def density_plot(data, vars, 
                 binwidths = None, hue = None,
                 binwidth_adj_ratio = None):

    from matplotlib.ticker import PercentFormatter

    # 1) vars가 1개뿐일 때 에러발생
    #    -> 1개     : type = str
    #    -> 2개이상 : type = ndarray, ...
    if type(vars)==str:
        vars = [vars]
    
    # 2) plotting (nrow,ncol) 설정
    nrow = math.ceil(len(vars)**(1/2))
    ncol = nrow

    # 3) binwidths가 없을 때, binwidth 설정
    # 출처 : http://www.aistudy.co.kr/paper/pdf/histogram_jeon.pdf
    if binwidths is None:
        binwidths = []
        for col in data[vars].columns:
            n_bin = math.ceil(1 + 3.32*math.log10(len(data)))
            binwidth = ( data[col].max() - train[col].min() ) / n_bin
            binwidths.append(binwidth)
            del binwidth
    
    # 4) 설정한 binwidth를 조정하는 비율
    if binwidth_adj_ratio is not None:
        binwidths = [binwidth * binwidth_adj_ratio for binwidth in binwidths]
    
    fig = plt.figure()
    
    # 5) vars 별로 plot 생성
    for iter,var in enumerate(vars):
        
        binwidth = binwidths[iter]
        
        # (1) histogram
        ax1 = fig.add_subplot(nrow, ncol, iter+1)
        g1 = sns.histplot(data = data, x = var, hue = hue,
                          kde = True, stat = 'probability', 
                          color = 'lightskyblue',
                          binwidth = binwidth, ax = ax1)
        ax2 = ax1.twinx()
        
        # (2) density plot
        g2 = sns.kdeplot(data = data, x = var, hue = hue,
                         color = 'red', lw = 2, ax = ax2)
        ax2.set_ylim(0, ax1.get_ylim()[1] / binwidth)                  # similir limits on the y-axis to align the plots
        #ax2.yaxis.set_major_formatter(PercentFormatter(1 / binwidth))  # show axis such that 1/binwidth corresponds to 100%
        ax2.grid(False)
        
        # (3) density plot y축 없애기
        g2.set(yticklabels=[]) 
        g2.set(ylabel=None)
        g2.tick_params(right=False)
        
        a,b = divmod(iter,ncol)
        if b!=0:
            g1.set(ylabel=None)
        
    # 안겹치도록 설정
    fig.tight_layout()
    plt.show()

# example : density_plot(train, vars=num_vari)

#-------------------------------------------------------------------------------------------------------#
# 2.6.9. density plot : histogram + density plot
#
# (1) grp_var vs hue_var 막대그래프
# (2) grp_var(x축), hue_var에 따른 각 num_var들의 barplot, violineplot, box+swarmplot + kdeplot
#-------------------------------------------------------------------------------------------------------#
#- grp_var : x축 구분할 그룹변수 (text)
#- num_vari : 숫자형 변수 (list)
#- data : 기준 data
#- title_text : plot title (text)
#- hue_var : hue 그루핑변수
#-------------------------------------------------------------------------------------------------------#
def plot_num(grp_var, num_vari, data, title_text=None, hue_var=None):
    
    # (1)번 그래프 setting
    fig0 = plt.figure(figsize=(3,3))
    ax0  = fig0.add_subplot(1,1,1)
    
    if title_text is None:
        title_text = grp_var
        
    plt.title(title_text, loc='left', pad=20, fontdict={'fontsize' : 30,
                                                        'fontweight' : 'bold',
                                                        'color' : 'c'})
    
    # grp_var와 hue_var가 겹치는 경우, hue를 나누지 않음
    if (grp_var!=hue_var) and (hue_var is not None):

        ct = pd.crosstab(data[grp_var],data[hue_var])
        ax = ct.plot(kind='bar', stacked=False, rot=0, ax=ax0)
        ax.legend(title=hue_var, bbox_to_anchor=(1, 1.02), loc='upper left')
        
    else:
        ct = data[grp_var].value_counts()
        ax = ct.plot(kind='bar', stacked=False, rot=0, ax=ax0)
        
    # show
    plt.xlabel('')
    plt.show()
    
    # 숫자변수중에 [grp,id]변수가 있으면 제외
    num_vari_x = list(set(num_vari) - set([grp_var,'id']))
    
    # plt 생성
    fig = plt.figure(figsize=(15,15)) # figsize=(15,7)
    plt.axis('off') # 안끄면 x축에 0~1까지 축생김
    
    for iter,var in enumerate(num_vari_x):

        # hue랑 grp_var랑 같으면 hue를 넣지않음
        hue_x = [None if grp_var==hue_var else hue_var][0]

        # (n,4) plot
        ax1 = fig.add_subplot(len(num_vari_x),4,4*iter+1)
        ax2 = fig.add_subplot(len(num_vari_x),4,4*iter+2)
        ax3 = fig.add_subplot(len(num_vari_x),4,4*iter+3)
        ax4 = fig.add_subplot(len(num_vari_x),4,4*iter+4)

        #---------------------------------------------------------------------------------------------
        # (2-1) 3번째 : box + swarm plot (ylim가져오기위해서 제일 먼저 실행)
        #---------------------------------------------------------------------------------------------
        g11=sns.swarmplot(x=grp_var, y=var, data=data, ax = ax3, color='crimson', marker='*', s = 7)
        g12=sns.boxplot  (x=grp_var, y=var, data=data, ax = ax3)
        g12.set(ylabel=None)
        g12.set(yticklabels=[])

        #---------------------------------------------------------------------------------------------
        # (2-2) 1번째 : barplot
        #---------------------------------------------------------------------------------------------
        ax1.set_ylim(ax3.get_ylim())
        g21=sns.barplot(x=grp_var, y=var, data=data, ax=ax1, hue=hue_x)
        # g21.set(ylabel=None)
        # g21.set(yticklabels=[])
        # g21.axes.set_title(str(iter+1) + ':' + var, fontsize=20, weight='bold', ha='left', x=-.05)
        g21.set_ylabel(var,fontsize=20)

        #---------------------------------------------------------------------------------------------
        # (2-3) 2번째 : violinplot
        #---------------------------------------------------------------------------------------------
        ax2.set_ylim(ax3.get_ylim())
        g31=sns.violinplot(x=grp_var, y=var, data=data, ax=ax2, legend=False, hue=hue_x)
        g31.set(ylabel=None)
        g31.set(yticklabels=[])

        #---------------------------------------------------------------------------------------------
        # (2-4) 4번째 : density plot
        #---------------------------------------------------------------------------------------------
        ax4.set_ylim(ax3.get_ylim())
        g41=sns.kdeplot(y=var, hue=grp_var, data=data, ax=ax4)
        g41.set(ylabel=None)
        g41.set(yticklabels=[])
        g41.tick_params(right=False)
        g41.set(xlabel=None)
        g41.set(xticklabels=[])
        
        # 맨 아래에만 x축이 생성되도록 setting
        if (iter+1) != len(num_vari_x):
            
            g12.set(xlabel=None)
            g12.set(xticklabels=[])

            g21.set(xlabel=None)
            g21.set(xticklabels=[])

            g31.set(xlabel=None)
            g31.set(xticklabels=[])

    fig.tight_layout()
    plt.show()

# # example
# plot_num(grp_var = 'sex', num_vari = num_vari, hue_var = 'target',
#          data = train, title_text = 'sex')

<br></br>
## 버전 확인

In [None]:
import sys
print(sys.version)

<br></br>
<br></br>
># **Data**

<br></br>

## 변수정보 (변수명 참조 : [Dacon](https://dacon.io/competitions/official/235848/data))

|변수명 | 변수정보 | 기준 | 변수상세 |
|:---:|:---|:---|:---|
| id | 데이터 고유 id | | |
| age | 나이 | | |
| sex | 성별  | 여자 = 0, 남자 = 1 | | |
| cp | 가슴 통증 종류 | 무증상 = 0, 일반적이지 않은 협심증 = 1, 협심증이 아닌 통증 = 2, 일반적인 협심증 = 3 | | |
| trestbps | 휴식 중 혈압(mmHg) | | | resting blood pressure |
| chol | 혈중 콜레스테롤(mg/dl) | | serum cholestoral |
| fbs | 공복 중 혈당 | 120 mg/dl 이하일 시 = 0, 초과일 시 = 1 | | fasting blood sugar |
| restecg | 휴식 중 심전도 결과 | 좌심실비대증이 의심되거나 확실한 경우 = 0, 정상 = 1, having ST-T wave abnormality = 2 | resting electrocardiographic |
| thalach | 최대 심박수 | | maximum heart rate achieved |
| exang | 활동으로 인한  협심증 여부 | 없음 = 0, 있음 = 1 | exercise induced angina |
| oldpeak | 휴식 대비 운동으로 인한 ST 하강 | | ST depression induced by exercise relative to rest |
| slope | 활동 ST 분절 피크의 기울기 | 하강 = 0, 평탄 = 1, 상승 = 2 | the slope of the peak exercise ST segment |
| ca | 형광 투시로 확인된 주요 혈관 수 | 0~3 개, <strong style="color:red">Null값은 4로 인코딩됨</strong> | number of major vessels colored by flouroscopy |
| thal | 지중해빈혈 여부 | 정상 = 1, 고정 결함 = 2, 가역 결함 = 3, <strong style="color:red">Null값은 0으로 인코딩됨</strong> | thalassemia |
| target | 심장 질환 진단 여부 | 혈관 지름 축소 50% 미만 = 0, 혈관 지름 축소 50% 이상 = 1 | |

<br></br>
## Data Load

In [19]:
COL_TYPE = {
    'id'       : int,
    'age'      : int,
    'sex'      : str,    # 0,1
    'cp'       : str,    # 0,1,2,3
    'trestbps' : int,
    'chol'     : int,
    'fbs'      : str,    # 0,1
    'restecg'  : str,    # 0,1,2
    'thalach'  : int,
    'exang'    : str,    # 0,1
    'oldpeak'  : float,
    'slope'    : str,    # 0,1,2
    'ca'       : str,    # 0~3이고, Null=4
    'thal'     : str,    # 1,2,3이고, Null=0
    'target'   : str,    # 0,1
}

# Train Data Load (550,068 rows, 12 columns)
train = pd.read_csv(DATA_PATH + 'train.csv', dtype = COL_TYPE)
test  = pd.read_csv(DATA_PATH + 'test.csv', dtype = COL_TYPE)
sub   = pd.read_csv(DATA_PATH + 'sample_submission.csv', dtype = COL_TYPE)

train

Unnamed: 0,id,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,1,53,1,2,130,197,1,0,152,0,1.2,0,0,2,1
1,2,52,1,3,152,298,1,1,178,0,1.2,1,0,3,1
2,3,54,1,1,192,283,0,0,195,0,0.0,2,1,3,0
3,4,45,0,0,138,236,0,0,152,1,0.2,1,0,2,1
4,5,35,1,1,122,192,0,1,174,0,0.0,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146,147,50,1,2,140,233,0,1,163,0,0.6,1,1,3,0
147,148,51,1,2,94,227,0,1,154,1,0.0,2,1,3,1
148,149,69,1,3,160,234,1,0,131,0,0.1,1,1,2,1
149,150,46,1,0,120,249,0,0,144,0,0.8,2,0,3,0


<br></br>
## Missing Check

In [20]:
print(color.BOLD + color.BLUE + '> # of train Missing : \n' + color.END, missing_column_check(train, COL_TYPE), '\n')
print(color.BOLD + color.BLUE + '> # of test  Missing : \n' + color.END, missing_column_check(test , COL_TYPE), '\n')

> # of train Missing : 0
> # of test  Missing : 0


##### Missing 값은 없음.
##### 이외에 Null을 인위적으로 다른 값으로 인코딩한 값들 확인
<br></br>

##### 정확한 Missing value 분석은 EDA에서 진행

<br></br>
<br></br>
># **EDA**

In [33]:
# 숫자형변수, 문자형변수
num_vari  = [key for key in COL_TYPE.keys() if COL_TYPE[key] in [int,float]]
char_vari = [key for key in COL_TYPE.keys() if COL_TYPE[key] in [str      ]]

num_df  = train[num_vari]
char_df = train[char_vari]

print('전체 변수 :', len(train.columns))
print('숫자 변수 :', len(num_vari))
print('문자 변수 :', len(char_vari))

전체 변수 : 15
숫자 변수 : 6
문자 변수 : 9


<br></br>
## Characteristic Variable

<br></br>
## Numeric Variable

<br></br>
### hist + kde plot (hue=target) : 각 숫자형변수별 분포 확인 & target에 따른 분포확인

In [None]:
if PLOT:

    # No hue
    print(color.BOLD + color.BLUE + '> No Group' + color.END)
    density_plot(train,
                 vars = set(num_vari) - set(['id']),
                 binwidth_adj_ratio = 0.8)
    plt.show()

    # hue : target
    print(color.BOLD + color.BLUE + '> Group by Target' + color.END)
    density_plot(train,
                 vars = set(num_vari) - set(['id']),
                 hue = 'target',
                 binwidth_adj_ratio = 0.8)
    plt.show()

<br></br>
### Pairplot : 숫자형변수간의 관계파악

In [None]:
if PLOT:

    pairplot_df = num_df.copy().drop(['id'],axis=1)

    sns.pairplot(pd.concat([pairplot_df, train.target], axis = 1),
                 corner=True, hue = 'target')
    plt.show()

<br></br>
## Numeric Variable * Characteristic Variable

<br></br>
### train

In [None]:
if PLOT:

    for _iter,_col in enumerate(sorted(char_vari)):
        plot_num(grp_var = _col, num_vari = num_vari, hue_var='target',
                 data = train,
                 title_text = str(_iter+1) + '. ' + _col)

<br></br>
### test

In [None]:
if PLOT:

    tmp_vari = list(set(char_vari) - set(['target']))

    for _iter,_col in enumerate(sorted(tmp_vari)):
        plot_num(grp_var = _col, num_vari = num_vari,
                 data = test, hue_var=None,
                 title_text = str(_iter+1) + '. ' + _col)

<br></br>
<br></br>
># **Segment** : segment를 구분하여 따로 모델 적합

<br></br>
## 각 조합별 건수 확인

In [None]:
segment0 = pd.Series(['1' if ca=='0' else '0' for ca in train.ca])
segment1 = pd.Series(['1' if cp=='0' else '0' for cp in train.cp])
segment2 = train.sex
segment3 = train.exang
segment4 = pd.Series(['1' if slope=='2' else '0' for slope in train.slope])

def comb_seg(n_comb, head):

    import itertools

    # 각 조합에 대해 건수를 추출
    res = []
    combination = list(itertools.product([1, 0], repeat=5))
    for comb in combination:
        
        # n_comb와 맞는 것들만 추출
        if sum(comb)==n_comb:
            comb_number = np.where(np.array(comb)==1)[0].tolist()
            comb_seg_ele = ['segment' + str(c) for c in comb_number]
            comb_seg = eval('+'.join(comb_seg_ele))

            ct = np.array(list(cnt(comb_seg).values()))
            ct = ct.flatten()
            ct = np.unique(ct)
            
            res.append(ct)

    res = np.array(res)
    res_shape = res.shape
    
    # level이 달라서, flatten이 안되는 경우
    # -> 1행의 list로 변환
    if len(res_shape)==1:
        ret = []
        for r in res:
            for ele in r:
                ret.append(ele)
        
        res = ret
        
    else:
        res = res.flatten()
        
    # unique, sort, head
    res = np.unique(sorted(res))[:head]

    return(res)

print('  iter : min(1st, 2nd, 3rd, ...)')
print('-'*40)
for iter in range(1,6):
    print(f'     {iter} : {comb_seg(iter,head=7)}')

##### 2개 조합부터는 건수가 20개 이하인 Segment가 있는데, 건수가 적절하지 않아보임.
##### 1개씩 Segment로 사용하는 것이 좋아보임

<br></br>
># **Preprocessing**

<br></br>
## 건수 적은 변수들 합치기 & 파생변수 생성

In [None]:
def preprocessing(_df):
    df = _df.copy()
    
    #------------------------------------------------------------#
    # 1. 
    #------------------------------------------------------------#
    
    return df

train2 = preprocessing(train.copy())
test2  = preprocessing(test .copy())

# col type에 추가
for str_var in []:
    COL_TYPE[str_var] = str

<br></br>
##### 추가한 변수에 대해서 EDA 진행

In [None]:
for _iter,_col in enumerate(sorted([])):
    plot_num(grp_var = _col, num_vari = num_vari, hue_var='target',
             data = train2,
             title_text = str(_iter+1) + '. ' + _col)

<br></br>
## 교호작용항

In [None]:
if INTERACTION is True:
    train3 = interaction_term(train2,num_vari)
    test3  = interaction_term(test2 ,num_vari)

    for int_var in train3.columns[[col.find('*')>0 for col in train3.columns]]:
        COL_TYPE[int_var] = int
else:
    train3 = train2.copy()
    test3  = test2 .copy()

<br></br>
## astype('category')

In [None]:
# # (1) onehot encoding
# def onehot_encoding(data, col_types):

#     raw_data = data.copy()
    
#     cols = list(set(data.columns) - set(['target']))

#     for col in cols:
#         if col_types[col]==str:

#             data = pd.concat([
#                 data.drop([col],axis=1).reset_index(drop=True),
#                 pd.get_dummies(data[col], prefix = col).reset_index(drop=True).apply(lambda x:x.astype(int))
#                 ],
#                 axis=1)
    
#     return(data)

# (2) str들 모두 int/category로 바꾸기
def str_convert(data, col_types, convert = [int,'category']):

    cols = list(set(data.columns) - set(['target']))

    for col in cols:
        if col_types[col]==str:
            data[col] = data[col].astype(convert)
    
    return(data)

train4 = str_convert(train3, col_types = rmkey(COL_TYPE,'target'), convert = 'category')
test4  = str_convert(test3 , col_types = rmkey(COL_TYPE,'target'), convert = 'category')

<br></br>
># **Category Level Check**

In [None]:
# 하나의 데이터셋에 대해서 level 개수를 search
def check_category(data,col_types, ret=['dict','list']):

    cols = list(set(data.columns) - set(['target']))

    if ret=='dict':
        len_cate = {}
    elif ret=='list':
        len_cate = []
    else:
        raise('error ret')
        
    for col in cols:
        if col_types[col]==str:
            _len = len(data[col].value_counts().index)
            
            if ret=='dict':
                len_cate[col] = _len
            elif ret=='list':
                len_cate.append(_len)
            
    return(len_cate)

check_category(train4,COL_TYPE,ret='dict')

##### train에서 모두 2개 이상으로, 이상없음
<br></br>

In [None]:
# 두개 데이터셋의 동일한 변수에 대해서, level이 같은지 확인
def check_category2(data1,data2,col_types):

    cols = list(set(data1.columns) - set(['target']))
    max_char_len = max([len(x) if col_types[x]==str else 0 for x in col_types.keys()])
    
    # 없음
    for col in cols:
        if col_types[col]==str:
            data1_cate = data1[col].value_counts().index.values.sort_values()
            data2_cate = data2[col].value_counts().index.values.sort_values()
            n_blank = (max_char_len-len(col))
            if len(data1_cate)==len(data2_cate):
                same_index =  data1_cate != data2_cate
                print(col, ' '*n_blank, ':', same_index.sum())
            else:
                print(col, ' '*n_blank, ': differ length')
            
print(color.BOLD + color.BLUE + '> 다른 카테고리의 개수' + color.END)
check_category2(train4,test4,COL_TYPE)

##### 카테고리가 match 확인
<br></br>

<br></br>
<br></br>
># **Scaling**

In [None]:
scale_var = list(set([col for col in COL_TYPE.keys() if COL_TYPE[col]!=str]) - set(['id']))
# len(scale_var),len(np.unique(scale_var)[0])

# 모두 0 이상의 값을 가짐
# min, max가 test에서 bound를 벗어나는게 있긴함..
print(color.BOLD + color.BLUE + '> train' + color.END)
for var in scale_var:
    print(minmax(train4[var]))

print('\n' + color.BOLD + color.BLUE + '> test ' + color.END)
for var in scale_var:
    print(minmax(test4 [var]))

In [None]:
if SCALE:

    # Normalization
    scaler = MinMaxScaler()
    scaler.fit(train4[scale_var])
    # print(scaler.n_samples_seen_, scaler.data_min_, scaler.data_max_, scaler.feature_range)

    train5 = train4.copy()
    test5  = test4 .copy()

    train5[scale_var] = scaler.transform(train5[scale_var])
    test5 [scale_var] = scaler.transform(test5 [scale_var])
    
    print(color.BOLD + color.BLUE + '> Min Max Scaling Ratio' + color.END)
    print(test5[scale_var].apply(lambda x: minmax(x)))
    
else:
    train5 = train4.copy()
    test5  = test4. copy()

<br></br>
<br></br>
># **Modelling**

<br></br>
## LGBM setting with Bayesian Optimization

In [None]:
# bayesian optimization에 쓰일 hyper parameter들의 boundary
bounds_LGB = {
    'num_leaves': (100, 800), 
    'min_data_in_leaf': (0, 150),
    'bagging_fraction' : (0.3, 0.9),
    'feature_fraction' : (0.3, 0.9),
    'min_child_weight': (0.01, 1.),   
    'reg_alpha': (0.01, 1.), 
    'reg_lambda': (0.01, 1),
    'max_depth':(6, 23),
    'learning_rate': (1e-8, 0.05)
}

# bayesian optimazation을 통하여 hyper parameter를 선택한
# lightgbm modelling
def build_lgb(x, y, val_x, val_y,
              init_points=INIT_POINTS, n_iter=N_ITER, cv=N_CV, 
              ret_param=True, verbose=-1, is_test=False, 
              SEED=SEED):
    
    # verbose : 2 항상 출력, verbose = 1 최댓값일 때 출력, verbose = 0 출력 안함
    
    # (1) 각 hyper parameter들의 lgb model의 f1 score를 return
    def LGB_bayesian(
        num_leaves, 
        bagging_fraction,
        feature_fraction,
        min_child_weight, 
        min_data_in_leaf,
        max_depth,
        reg_alpha,
        reg_lambda,
        learning_rate,
         ):
        # LightGBM expects next three parameters need to be integer. 
        num_leaves = int(num_leaves)
        min_data_in_leaf = int(min_data_in_leaf)
        max_depth = int(max_depth)

        assert type(num_leaves) == int
        assert type(min_data_in_leaf) == int
        assert type(max_depth) == int

        params = {
            'num_leaves': num_leaves, 
            'min_data_in_leaf': min_data_in_leaf,
            'min_child_weight': min_child_weight,
            'bagging_fraction' : bagging_fraction,
            'feature_fraction' : feature_fraction,
            'learning_rate' : learning_rate,
            'max_depth': max_depth,
            'reg_alpha': reg_alpha,
            'reg_lambda': reg_lambda,
            'objective': 'binary',
            'save_binary': True,
            'seed': SEED,
            'feature_fraction_seed': SEED,
            'bagging_seed': SEED,
            'drop_seed': SEED,
            'data_random_seed': SEED,
            'boosting': 'gbdt', 
            'verbose': -1,
            'boost_from_average': True,
            'metric':METRIC,
            'n_estimators': N_ESTIMATORS, # 1000
            'n_jobs': -1,
        }    

        ## set reg options
        model = lgb.LGBMClassifier(**params)
        model.fit(x, y, eval_set=(val_x, val_y), early_stopping_rounds=30, verbose=-1)
        pred = model.predict(val_x)
        score = f1_score(val_y, pred)
        return score
    
    # (2) Get hyper parameter by bayesian optimazation
    optimizer = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state=SEED, verbose=-1)
    
    # initial point, n_iter에 대해서 maximize 하는 bayesian optimazation 실행
    optimizer.maximize(init_points=init_points, n_iter=n_iter, 
                       acq='ei', xi=0.01)
    # init_points는 처음 탐색 횟수. 
    # pbound에서 설정한 구간 내에서 init_points 만큼 입력값을 샘플링하여 계산이 진행
    # n_iter은 연산 횟수입니다. 따라서 총 25번을 수행
    # xi는 exploration-explotation의 강도를 조절하는 인수로 일반적으로 0.01로 설정하여 exploration을 높여줌
    
    
    # (3) bayesian optimazation를 통해서 얻은 hyper parameter
    param_lgb = {
        'min_data_in_leaf': int(optimizer.max['params']['min_data_in_leaf']), 
        'num_leaves': int(optimizer.max['params']['num_leaves']), 
        'learning_rate': optimizer.max['params']['learning_rate'],
        'min_child_weight': optimizer.max['params']['min_child_weight'],
        'bagging_fraction': optimizer.max['params']['bagging_fraction'], 
        'feature_fraction': optimizer.max['params']['feature_fraction'],
        'reg_lambda': optimizer.max['params']['reg_lambda'],
        'reg_alpha': optimizer.max['params']['reg_alpha'],
        'max_depth': int(optimizer.max['params']['max_depth']), 
        'objective': 'binary',
        'save_binary': True,
        'seed': SEED,
        'feature_fraction_seed': SEED,
        'bagging_seed': SEED,
        'drop_seed': SEED,
        'data_random_seed': SEED,
        'boosting': 'gbdt', 
        'verbose': -1,
        'boost_from_average': True,
        'metric': METRIC, #'auc',
        'n_estimators': N_ESTIMATORS, # 1000
        'n_jobs': -1,
    }

    # final parameter
    params = param_lgb.copy()
    
    # final model
    model = lgb.LGBMClassifier(**params)
    model.fit(x, y, eval_set=(val_x, val_y), early_stopping_rounds=EARLY_STOPPING_ROUNDS,
              callbacks = [lgb.early_stopping(10, verbose=-1), lgb.log_evaluation(period=-1)])
    
    if ret_param:
        return model, params
    else:
        return model

<br></br>
##### segment and dataset setting

In [None]:
# ca==cp > exang > slope > sex
# > ca는 2,3건수가 너무적음
# > cp는 1,3건수가 너무 적음
seg_var = ['is_ca','is_cp','sex','exang','slope2']

# 각 세그별 최소 건수
[train5[seg_var_x].value_counts().min() for seg_var_x in seg_var]

<br></br>
##### Initial Value 확인

In [None]:
print('-'*50)
print('   Initial Values')
print('-'*50)
max_char_len = max([len(var) for var in ini_var])
for var in ini_var:
    char_len = ' '*(max_char_len-len(var))
    print(f'\t{var} {char_len} : {eval(var)}')

<br></br>
## LGBM fitting for each segment category

In [None]:
start_time = time.time()

train_df = train5.copy()
test_df  = test5 .copy()

for seg_var_x in seg_var:

    train_df[seg_var_x+'_pred']   = np.nan
    test_df [seg_var_x+'_pred']   = np.nan

    train_df[seg_var_x+'_tr_idx'] = np.nan

    # segment별로 modelling
    for iter in range(0,len(train_df[seg_var_x].value_counts().index)):

        # segment setting
        seg_var_value = train_df[seg_var_x].value_counts().index[iter]

        # data setting
        if seg_var_x is not None:
            tr_seg_df = train_df[train_df[seg_var_x] == seg_var_value]
            te_seg_df = test_df [test_df [seg_var_x] == seg_var_value]

            drop_var = ['id','target'] +\
            [seg_var_x           for seg_var_x in seg_var] +\
            [seg_var_x+'_pred'   for seg_var_x in seg_var] +\
            [seg_var_x+'_tr_idx' for seg_var_x in seg_var]
            
            X_train = tr_seg_df[list(set(tr_seg_df.columns)-set(drop_var))]
            X_test  = te_seg_df[list(set(te_seg_df.columns)-set(drop_var))]

            y_train = tr_seg_df['target'][tr_seg_df[seg_var_x] == seg_var_value].astype(int).values

        else:
            tr_seg_df = train_df
            te_seg_df = test_df

            X_train = tr_seg_df.drop(['id','target'],axis=1)
            X_test  = te_seg_df.drop(['id'         ],axis=1)

            y_train = tr_seg_df['target'].astype(int).values

            
        # feature importance
        model = lgb.LGBMClassifier(seed = SEED)
        model.fit(X_train, y_train, verbose=-1)

        feature_imp = pd.DataFrame(zip(X_train.columns,
                                       model.feature_importances_.astype(float)), 
                                   columns=['feature','imp']).sort_values(by='imp')
            
        reduced_var = list(feature_imp.feature[feature_imp.imp>1])

        X_train_new = X_train[reduced_var]
        X_test_new  = X_test [reduced_var]
        
        
        # modelling
        n_fold = 5
        sf = StratifiedKFold(n_fold, shuffle=True, random_state=SEED)

        y_tr = []
        y_te = []

        c = 1
        for tr_idx, val_idx in sf.split(X_train_new, y_train):
            print(len(tr_idx), len(val_idx))
            print('#'*25, f'CV {c}')

            model, _ = build_lgb(X_train_new.iloc[tr_idx ], y_train[tr_idx ], 
                                 X_train_new.iloc[val_idx], y_train[val_idx],
                                 init_points=INIT_POINTS, n_iter=N_ITER, cv=N_CV, 
                                 ret_param=True, is_test=False, 
                                 SEED=SEED)

            y_tr_0 = model.predict(X_train_new)
            y_te_0 = model.predict(X_test_new)

            y_tr.append(y_tr_0)
            y_te.append(y_te_0)

            c += 1

        # seg별 predict값 넣기
        train_df[seg_var_x+'_pred'][train_df[seg_var_x] == seg_var_value] = np.where(np.mean(y_tr, 0)>0.5, 1, 0)
        test_df [seg_var_x+'_pred'][test_df [seg_var_x] == seg_var_value] = np.where(np.mean(y_te, 0)>0.5, 1, 0)
        
end_time = time.time()

In [None]:
runtime = (end_time-start_time)/60
f'{runtime:.2f} Mins'

<br></br>
##### seg별 confusion matrix and f1_score

In [None]:
for seg_var_x in seg_var:
    
    tr_pred = train_df[f'{seg_var_x}_pred']
    tr_true = train_df.target.astype(int).values
    tr_f1   = f1_score(tr_pred,tr_true)
    
    print('-'*50)
    print(f'{seg_var_x} - f1_score : {tr_f1:.2f}')
    print(pd.crosstab(tr_pred,tr_true))

<br></br>
##### train_df 저장

In [None]:
# train_df.to_csv(OUT_PATH + 'train_df(2).csv', index=False)

In [None]:
train_df

<br></br>
##### 각 seg별 f1_score

In [None]:
f1_score_list = []
for seg_var_x in seg_var:
    f1 = f1_score(train_df[[seg_var_x + '_pred']].astype(int).values,train_df.target.astype(int).values)
    n_blank = ' '*(max_char_len-len(seg_var_x))
    
    if seg_var_x==seg_var[0]: print(' '*max_char_len, 'f1_score')
    print(f'{seg_var_x} {n_blank} {f1: .3f}')
    
    f1_score_list.append(f1)

<br></br>
##### 각 segment의 weight

In [None]:
weight = [f/sum(f1_score_list) for f in f1_score_list]
for seg,w in zip(seg_var,weight):
    n_blank = ' '*(max_char_len-len(seg))
    if seg==seg_var[0]: print(' '*max_char_len, '  weight')
    print(f'{seg} {n_blank} {w*100: .1f}%')

<br></br>
##### weighted predicted value : [1,0]의 조합이라서 똑같음

In [None]:
# (1) mixed
tr_pred_mix = train_df[[seg + '_pred' for seg in seg_var]].sum(axis=1)/len(seg_var)
tr_f1_score_mix = f1_score(np.where(tr_pred_mix>0.5,1,0), train_df.target.astype(int).values)

# (2) weighted mixed
tr_pred_weighted_mix = np.array([train_df[seg+'_pred'].astype(int).values*weight[iter] 
                                 for iter,seg in enumerate(['is_ca', 'is_cp', 'sex', 'exang', 'slope2'])]).sum(axis=0)
tr_f1_score_weighted_mix = f1_score(np.where(tr_pred_weighted_mix>0.5,1,0), train_df.target.astype(int).values)

print(f'f1_score of          mixed segment predicted value:{tr_f1_score_mix         : .3f}')
print(f'f1_score of weighted mixed segment predicted value:{tr_f1_score_weighted_mix: .3f}')

# a=pd.DataFrame({'x' : np.where(tr_pred_mix>0.5,1,0),
#                 'y' : np.where(tr_pred_weighted_mix>0.5,1,0)})
# pd.crosstab(a.x,a.y)

<br></br>
##### seg들을 조합해서 pred 생성

In [None]:
train_df['tr_pred_mix'] = np.where(train_df[[seg + '_pred' for seg in seg_var]].sum(axis=1)/len(seg_var)>0.5, '1', '0')
test_df ['te_pred_mix'] = np.where(test_df [[seg + '_pred' for seg in seg_var]].sum(axis=1)/len(seg_var)>0.5, '1', '0')

<br></br>
##### confusion matrix and f1_score

In [None]:
f1 = f1_score(train_df['tr_pred_mix'].astype(int).values,train.target.astype(int).values)
print(f'f1_score : {f1:.3f}\n')
print(pd.crosstab(train_df['tr_pred_mix'],train.target))

' / '.join([f'{var}:{eval(var)}' for var in ini_var])


# f1 : 0.899 ('SEED:777 / INIT_POINTS:15 / N_ITER:15 / N_CV:4 / EARLY_STOPPING_ROUNDS:30 / N_ESTIMATORS:2000 / METRIC:binary_logloss / PLOT:False / SCALE:True / INTERACTION:True')
# f1 : 0.859 ('SEED:777 / INIT_POINTS:15 / N_ITER:15 / N_CV:4 / EARLY_STOPPING_ROUNDS:30 / N_ESTIMATORS:2000 / METRIC:binary_logloss / PLOT:False / SCALE:True / INTERACTION:False')

# f1 : 0.897 ('SEED:777 / INIT_POINTS:15 / N_ITER:15 / N_CV:4 / EARLY_STOPPING_ROUNDS:30 / N_ESTIMATORS:2000 / METRIC:binary_logloss / PLOT:False / SCALE:False / INTERACTION:True')
# f1 : 0.877 ('SEED:777 / INIT_POINTS:15 / N_ITER:15 / N_CV:4 / EARLY_STOPPING_ROUNDS:30 / N_ESTIMATORS:2000 / METRIC:binary_logloss / PLOT:False / SCALE:False / INTERACTION:False')

# f1 : 0.920 ('SEED:777 / INIT_POINTS:45 / N_ITER:100 / N_CV:4 / EARLY_STOPPING_ROUNDS:30 / N_ESTIMATORS:2000 / METRIC:binary_logloss / PLOT:False / SCALE:True / INTERACTION:True')
# f1 : 0.860 ('SEED:777 / INIT_POINTS:50 / N_ITER:200 / N_CV:4 / EARLY_STOPPING_ROUNDS:30 / N_ESTIMATORS:2000 / METRIC:auc / PLOT:False / SCALE:True / INTERACTION:True')

# f1 : 0.907 ('SEED:777 / INIT_POINTS:50 / N_ITER:300 / N_CV:4 / EARLY_STOPPING_ROUNDS:30 / N_ESTIMATORS:2000 / METRIC:binary_logloss / PLOT:False / SCALE:True / INTERACTION:True')

In [None]:
test_df['te_pred_mix'].value_counts()

In [None]:
sub['target'] = ['1' if pred=='1' else '0' for pred in test_df['te_pred_mix']]
sub.target.value_counts()

In [None]:
# 점수 : 0.92537
# sub.to_csv(OUT_PATH + 'sample_18.csv', index=False)

In [None]:
a=pd.read_csv(OUT_PATH + '★sample_9_mixed.csv')
b=sub

print(pd.crosstab(a.target,b.target))
print(f1_score(a.target.astype(int).values,b.target.astype(int).values))

