Problem Solving Process
================
-----------------------------------

#### 0. Understand the problem and the data
#### 1. Load Data
#### 2. Checking Datas
[2.1 Basic Data Checking](#2.1-Basic-Data-Checking)  
[2.2 Checking Missing Data](#2.2-Checking-Missing-Data)   
[2.3 Checking Target Label](#2.3-Checking-Target-Label)    
[2.4 Checking Data Types](#2.4-Checking-Data-Types)  
[2.5 Checking for Outliers](#2.5-Checking-for-Outliers)  
#### 3. Explanatory Data Analysis
[3.1 Checking relation between all the featurea and target](#3.1-Checking-relation-between-all-the-featurea-and-target)   
[3.2 Visualize the relation between all the features and target](#3.2-Visualize-the-relation-between-all-the-features-and-target)  
[3.3 Visualize relationship between 2 features](#3.3-Visualize-relationship-between-2-features)  
[3.4 특성 합쳐서 분석해보기](#3.4-특성-합쳐서-분석해보기)  
[3.5 비대칭 정보에 대해 log 취해주기](#3.5-비대칭-정보에-대해-log-취해주기)  
[3.6 Correlation 확인해보기](#3.6-Correlation-확인해보기)
#### 4. Feature Engineering
[4.1 Fill Null values](#4.1-Fill-Null-values)  
[4.2 구간 데이터를 범주 데이터로 바꿔보기](#4.2-구간-데이터를-범주-데이터로-바꿔보기)  
[4.3 문자열 데이터를 수치형 데이터로 바꾸기](#4.3-문자열-데이터를-수치형-데이터로-바꾸기)    
[4.4 Heatmap으로 시각화해보기](#4.4-Heatmap으로-시각화해보기)  
[4.5 Polynominal Features](#4.5-Polynominal-Features)  
[4.6 Domain Knowledge Features](#4.6-Domain-Knowledge-Features)  
[4.5 One-Hot Encoding](#4.5-One-Hot-Encoding)  
[4.6 Aligning Training and Testing Data](#4.6-Aligning-Training-and-Testing-Data)  
[4.6 Drop Columns](#4.6-Drop-Columns)  
[4.7 Feature Scaling](#4.7-Feature-Scaling)  
#### 5. Model Selection
[5.1 train, validation, test set 분리해주기](#5.1-train,-validation,-test-set-분리해주기)  
[5.2 Model Lookthrough](#5.2-Model-Lookthrough)  
[5.3 Hyperparameter Tuning](#5.3-Hyperparameter-Tuning)  
[5.4 Plot Learning Curve](#5.4-Plot-Learning-Curve)  
[5.5 Test on Engineered Features](#5.5-Test-on-Engineered-Features)  
#### 6. Evaluation and Application
[6.1 Evaluate Accuracy](#6.1-Evaluate-Accuracy)  
[6.2 Feature importace](#6.2-Feature-importace)  
[6.3 Submission](#6.3-Submission)  

--------------------------------------

## 0. Understand the Problem and the Data

#### 0.1 데이터 파일과 컬럼 내용 파악하기 
- 컬럼 설명 확인하기
#### 0.2 평가 지표 확인하기
- 대회에서 어떤 평가지표(roc-auc,f1 등)를 사용하지는 확인하기

## 1. Load Data

In [None]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

## 2. Checking Datas

### 2.1 Basic Data Checking  
- 데이터 파일과 컬럼 내용 출력해서 확인해보기  
- train, test set 각각 읽어들이고 shape, head 확인해보기  

In [None]:
train.head()
train.tail()
train.shape
train.info()
train.describe()

test.head()
test.tail()
test.shape
test.info()
test.describe()

rows = train.shape[0]
columns = train.shape[1]
print('The train dataset contains {} rows and {} columns'.format(rows, columns))

### 2.2 Checking Missing Data  
- train, test 데이터 모두 확인해보기
- msno로 시각화 해보기 -> (누락된 데이터가 너무 많으면 사용하지 않음!)
- missing value 확인해보기
- 나중에 Imputer를 사용해 missing values를 채워야 한다.
      하지만 XGBoost는 imputer 방식 없이도 missing value를 처리할 수 있다. 
- 너무 많은 missing value를 가지고 있는 column은 drop해줄 수 있다. 
- 어떤 column이 유용한지 모르므로 일단 가지고 있는다. 

In [None]:
# missing values 비율 확인하는 방법 1(column의 수가 적을 때!)
for col in df_train.columns:
    msg = 'column : {:>10}\t Percent of Nan value : {:.2f}%'.format(col, 100 * (df_train[col].isnull().sum() / df_train[col].shape[0]))
    print(msg)
    
for col in df_test.columns:
    msg = 'column : {:>10}\t Percent of Nan value : {:.2f}%'.format(col, 100 * (df_test[col].isnull().sum() / df_test[col].shape[0]))
    print(msg)

In [None]:
# missing values 비율 확인하는 방법 2(column의 수가 많을 때!)
def missing_values_table(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 
                   1 : '% of Total Values'}
    )
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
    
    print('Your selected dataframe has ' + str(df.shape[1]) + 'columns.\n'
         'There are ' + str(mis_val_table_ren_columns.shape[0]) +
         ' columns that have missing values.')
    
    return mis_val_table_ren_columns

missing_values = missing_values_table(app_train)
missing_values.head(20)

In [None]:
# missing value 시각화
msno.matrix(df=df_train.iloc[:, :], figsize=(8,8), color=(0.8, 0.5, 0.2))

msno.matrix(df=df_train.iloc[:, :], figsize=(8,8), color=(0.8, 0.5, 0.2))

msno.bar(df=df_test.iloc[:, :], figsize=(8, 8), color=(0.8, 0.5, 0.2))

### 2.3 Checking Target Label  
- 0,1로 분류 가능한 레이블이 어떤 분포를 가지고 있는지 시각적으로 확인해보기  
- 극단적인 분포면 도움이 되지 않음..
- 수로 나타내보고, 시각화해봐서 확인해보기
- unbalanced 문제 해결 방법 : http://www.chioka.in/class-imbalance-problem/

In [None]:
# 시각화해보기
fig, ax = plt.subplots(1,2, figsize = (18,8))

df_train['Survived'].value_counts().plot.pie(explode = [0, 0.1], 
                                             autopct='%1.1f%%',
                                            ax=ax[0],
                                            shadow=True)
ax[0].set_title('Pie plot - Survived')
ax[0].set_ylabel('')
sns.countplot('Survived', data=df_train, ax=ax[1])
ax[1].set_title('Count plot - Survived')

plt.show()

In [None]:
# 수로 나타내보기
app_train['TARGET'].value_counts()

### 2.4 Checking Data Types 

- int64, float64(수치형), object(string or categorical)
- 데이터 타입별 컬럼 수 확인
- object column 각각에 별개의 값이 몇 개 있는지 확인
- MetaData 만들기

In [None]:
# 데이터 타입별 column의 수 세보기
app_train.dtypes.value_counts()

In [None]:
# MetaData 만들기
data = []
for f in train.columns:
    # Defining the role
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
         
    # Defining the level
    if 'bin' in f or f == 'target':
        level = 'binary'
    elif 'cat' in f or f == 'id':
        level = 'nominal'
    elif train[f].dtype == float:
        level = 'interval'
    elif train[f].dtype == int:
        level = 'ordinal'
        
    # Initialize keep to True for all variables except for id
    keep = True
    if f == 'id':
        keep = False
    
    # Defining the data type 
    dtype = train[f].dtype
    
    # Creating a Dict that contains all the metadata for the variable
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep,
        'dtype': dtype
    }
    data.append(f_dict)
    
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)

### 2.5 Checking for Outliers
- 이상치 처리는 상황에 따르나 일반적으로 missing value로 처리한 후 머신러닝을 적용하기 전에 Imputation을 진행한다. 
- 하지만 현재 DAYS_EMPLOYED column은 중요한 정보를 담고 있으므로 다른 값으로 채워주기로 함
- test set에도 마찬가지로 적용  
- 이상치 데이터가 target에 어떤 영향을 주는 지 확인해보기

In [None]:
# column마다 이상치 여부 확인해보기
app_train['DAYS_EMPLOYED'].describe()

In [None]:
# Outlier detection 

def detect_outliers(df,n,features):
    """
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing more than n outliers according
    to the Tukey method.
    """
    outlier_indices = []
    
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
        
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
        
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)        
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    
    return multiple_outliers   

# detect outliers from Age, SibSp , Parch and Fare
Outliers_to_drop = detect_outliers(train,2,["Age","SibSp","Parch","Fare"])

## 3. Explanatory Data Analysis

### 3.1 Checking relation between all the featurea and target

In [None]:
df_train[['Pclass', 'Survived']].groupby(['Pclass'], as_index = True).count()
df_train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=True).sum()
pd.crosstab(df_train['Pclass'], df_train['Survived'], margins=True).style.background_gradient(cmap='summer_r')

### 3.2 Visualize the relation between all the features and target

***bar, pie, distplot, countplot, violinplot, kdeplot 등을 활용***

In [None]:
# countplot
y_position = 1.02
fig, ax = plt.subplots(1,2,figsize=(18,8))
df_train['Pclass'].value_counts().plot.bar(color=['#CD7F32','#FFDF00','#D3D3D3'], ax=ax[0])
ax[0].set_title('Number of Passengers By Pclass', y=y_position)
ax[0].set_ylabel('Count')
sns.countplot('Pclass', hue='Survived', data=df_train, ax=ax[1])
ax[1].set_title('Pclass : Survived vs Dead', y=y_position)
plt.show()

In [None]:
# kdeplot
fig, ax = plt.subplots(1, 1, figsize=(9, 5))
sns.kdeplot(df_train.dropna()[df_train.dropna()['Survived'] == 1]['Age'], ax=ax)
sns.kdeplot(df_train.dropna()[df_train.dropna()['Survived'] == 0]['Age'], ax=ax)
plt.legend(['Survived == 1', 'Survived == 0'])
plt.show()

In [None]:
# violinplot
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("Pclass","Age", hue="Survived", data=df_train, scale='count', split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=df_train, scale='count', split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

### 3.3 Visualize relationship between 2 features

In [None]:
# factorplot 1
sns.factorplot('Pclass', 'Survived', hue = 'Sex', data=df_train, size=6, aspect=1.5)

In [None]:
# factorplot 2
sns.factorplot(x='Sex', y='Survived', col='Pclass', data=df_train, saturation=.5,
              size=9, aspect=1)

### 3.4 특성 합쳐서 분석해보기  
#### (예 형제,자매 + 부모,자녀 = 가족)  

In [None]:
df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch'] + 1
df_test['FamilySize'] = df_test['SibSp'] + df_test['Parch'] + 1

print('Maximum size of Family :', df_train['FamilySize'].max())
print('Minimum size of Family :', df_train['FamilySize'].min())

### 3.5 비대칭 정보에 대해 log 취해주기

In [None]:
# distplot
fig, ax = plt.subplots(1,1,figsize=(8,8))
g = sns.distplot(df_train['Fare'], color='b', label='Skewness : {:.2f}'.format(df_train['Fare'].skew()), ax=ax)
g = g.legend(loc='best')

df_test.loc[df_test.Fare.isnull(), 'Fare'] = df_test['Fare'].mean()

df_train['Fare'] = df_train['Fare'].map(lambda i : np.log(i) if i > 0 else 0)
df_test['Fare'] = df_test['Fare'].map(lambda i : np.log(i) if i >0 else 0)

fig, ax = plt.subplots(1,1,figsize=(8,8))
g = sns.distplot(df_train['Fare'], color = 'b', label = 'Skewness : {:.2f}'.format(df_train['Fare'].skew()), ax=ax)
g = g.legend(loc='best')

### 3.6 Correlation 확인해보기

- variable과 target 사이의 상관관계 파악
- 참고 
.00-.19 “very weak”  
.20-.39 “weak”  
.40-.59 “moderate”  
.60-.79 “strong”  
.80-1.0 “very strong”  
- 고객의 나이가 가장 큰 양의 상관관계를 가짐(하지만 나이는 음수 값이므로 절대값 수치 확인이 필요하다)
- 중요한 column은 시각화 해보기
- kde plot이 target에 미치는 column의 영향을 확인하기에 좀 더 용이하다. 
- 나이의 경우 구간으로 나눠 상환률을 확인하는 방법이 있다. 
- Pairplot은 여러 변수들 간의 관계를 확인하기 유용하다

In [None]:
# 상관관계 높은 순으로 확인해보기
correlations = app_train.corr()['TARGET'].sort_values()

print('Most Positive Correlations : \n', correlations.tail(15))
print('\nMost Negative Correlations : \n', correlations.head(15))

In [None]:
# 시각화로 상관관계 파악해보기 
# EXT_SOURCE_3이 target과 가장 큰 차이를 보인다. 
# 약한 상관관계지만 머신러닝 모델에 유용한 정보를 제공한다. 
plt.figure(figsize = (10, 12))

for i, source in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
    
    plt.subplot(3, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, source].dropna(), label = 'target == 0')
    # plot loans that were not repaid
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, source].dropna(), label = 'target == 1')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value' % source)
    plt.xlabel('%s' % source); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

## 4 Feature Engineering
- Feature engineering에는 새로운 feature를 만들어내거나 feature를 select하는 과정이 포함되어 있다.
- Good column to read : https://blog.featurelabs.com/secret-to-data-science-success/
- automated tools : https://docs.featuretools.com/getting_started/install.html
- Polynominal Features : 현재 있는 feauture의 제곱 값과 여러 개별 variable의 조합인 interaction term을 생성한다.  
  개별 특성은 큰 영향이 없을지라도 특성의 조합으로 target에 큰 영향을 줄 수 있다.   
  너무 크게 degree를 설정하면 안된다(overfitting문제 발생)

### 4.1 Fill Null values  
- 다른 특성을 이용하여 누락 데이터 채우기  
( 예 이름에서 이니셜을 뽑아내고 이니셜의 평균 나이로 누락된 나이 데이터 채우기)  
- 가장 많은 소속으로 채우기  

***다른 특성을 이용하여 Null value 채우기***

In [None]:
# initial 뽑아내기
df_train['Initial'] = df_train.Name.str.extract('([A-Za-z]+)\.')
df_test['Initial'] = df_test.Name.str.extract('([A-Za-z]+)\.')

# crosstab으로 수 확인
pd.crosstab(df_train['Initial'], df_train['Sex']).T.style.background_gradient(cmap='summer_r')

# 각 initial별 평균 나이 확인
df_train['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don', 'Dona'],
                        ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr', 'Mr'],inplace=True)

df_test['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don', 'Dona'],
                        ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr', 'Mr'],inplace=True)
df_train.groupby('Initial').mean()

# initial별 평균으로 null 값 채워주기
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mr'),'Age'] = 33
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mrs'),'Age'] = 36
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Master'),'Age'] = 5
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Miss'),'Age'] = 22
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Other'),'Age'] = 46

df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Mr'),'Age'] = 33
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Mrs'),'Age'] = 36
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Master'),'Age'] = 5
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Miss'),'Age'] = 22
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Other'),'Age'] = 46

In [None]:
# Using Imputer
# fit on the training data
imputer.fit(train)

# transform both training and testing data
train = imputer.transform(train)
test = imputer.transform(app_test)

***가장 많은 소속으로 null value 채우기***

In [None]:
print('Embarked has ', sum(df_train['Embarked'].isnull()), 'Null values')
df_train['Embarked'].fillna('S', inplace = True)

### 4.2 구간 데이터를 범주 데이터로 바꿔보기  
- 직접 다 바꾸거나  
- 함수를 활용하거나(apply)  
- pandas qcut으로 전달된 bins에 따라 구간 나눠줌

In [None]:
# method1
df_train['Age_cat'] = 0
df_train.loc[df_train['Age'] < 10, 'Age_cat'] = 0
df_train.loc[(10 <= df_train['Age']) & (df_train['Age'] < 20), 'Age_cat'] = 1
df_train.loc[(20 <= df_train['Age']) & (df_train['Age'] < 30), 'Age_cat'] = 2
df_train.loc[(30 <= df_train['Age']) & (df_train['Age'] < 40), 'Age_cat'] = 3
df_train.loc[(40 <= df_train['Age']) & (df_train['Age'] < 50), 'Age_cat'] = 4
df_train.loc[(50 <= df_train['Age']) & (df_train['Age'] < 60), 'Age_cat'] = 5
df_train.loc[(60 <= df_train['Age']) & (df_train['Age'] < 70), 'Age_cat'] = 6
df_train.loc[70 <= df_train['Age'], 'Age_cat'] = 7

df_test['Age_cat'] = 0
df_test.loc[df_test['Age'] < 10, 'Age_cat'] = 0
df_test.loc[(10 <= df_test['Age']) & (df_test['Age'] < 20), 'Age_cat'] = 1
df_test.loc[(20 <= df_test['Age']) & (df_test['Age'] < 30), 'Age_cat'] = 2
df_test.loc[(30 <= df_test['Age']) & (df_test['Age'] < 40), 'Age_cat'] = 3
df_test.loc[(40 <= df_test['Age']) & (df_test['Age'] < 50), 'Age_cat'] = 4
df_test.loc[(50 <= df_test['Age']) & (df_test['Age'] < 60), 'Age_cat'] = 5
df_test.loc[(60 <= df_test['Age']) & (df_test['Age'] < 70), 'Age_cat'] = 6
df_test.loc[70 <= df_test['Age'], 'Age_cat'] = 7

# method 2
def category_age(x):
    if x < 10:
        return 0
    elif x < 20:
        return 1
    elif x < 30:
        return 2
    elif x < 40:
        return 3
    elif x < 50:
        return 4
    elif x < 60:
        return 5
    elif x < 70:
        return 6
    else:
        return 7
    
df_train['Age_cat_2'] = df_train['Age'].apply(category_age)

df_train.drop(['Age', 'Age_cat_2'], axis=1, inplace=True)
df_test.drop(['Age'], axis=1, inplace=True)

### 4.3 문자열 데이터를 수치형 데이터로 바꾸기  
- map 활용(Series용)  
- apply는 DataFrame용!  

In [None]:
df_train['Initial'] = df_train['Initial'].map({'Master' : 0, 'Miss' : 1, 'Mr' : 2, 'Mrs' : 3, 'Other' : 4})
df_test['Initial'] = df_test['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})

df_train['Embarked'].unique()

df_train['Embarked'].value_counts()

df_train['Embarked'] = df_train['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
df_test['Embarked'] = df_test['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

df_train['Embarked'].isnull().any()

df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})

### 4.4 Heatmap으로 시각화해보기  
- 예측하고자 하는 레이블과 가장 강한 상관관계를 가지고 있는 특성 파악하기  
- 서로 상관관계가 강한 특성들이 있는지 확인해보기  
(상관관계가 1 혹은 -1이 나온다는 것은 얻을 수 있는 정보가 하나라는 뜻!)  

In [None]:
heatmap_data = df_train[['Survived', 'Pclass', 'Sex', 'Fare', 'Embarked', 'FamilySize', 'Initial', 'Age_cat']]

colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(heatmap_data.astype(float).corr(), linewidth=0.1, vmax=1.0,
           square = True, cmap=colormap, linecolor='white', annot=True, annot_kws={'size' : 16})

del heatmap_data

### 4.5 Polynominal Features

- feautre 추출 후 imputer 적용, target label 제거, create polynominal,  
  train, transform, 상관관계 파악

In [None]:
# polynominal을 적용할 feature들 추출
poly_features = app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'TARGET']]
poly_features_test = app_test[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]

# imputer for handling missing values
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy = 'median')

# target value 제거해주기
poly_target = poly_features['TARGET']
poly_features = poly_features.drop(columns = ['TARGET'])

# Need to impute missing values
# note 3.1 참고
poly_features = imputer.fit_transform(poly_features)
poly_features_test = imputer.transform(poly_features_test)

from sklearn.preprocessing import PolynomialFeatures
                                  
# Create the polynomial object with specified degree
poly_transformer = PolynomialFeatures(degree = 3)

# Train the polynomial features
poly_transformer.fit(poly_features)

# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial Features shape: ', poly_features.shape)

# 새롭게 생성된 feature 이름 확인해보기
poly_transformer.get_feature_names(input_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'])[:15]

# 몇몇 feature들은 target과 오리지널보다 더 큰 상관관계를 보임
# Create a dataframe of the features 
poly_features = pd.DataFrame(poly_features, 
                             columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                           'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Add in the target
poly_features['TARGET'] = poly_target

# Find the correlations with the target
poly_corrs = poly_features.corr()['TARGET'].sort_values()

# Display most negative and most positive
print(poly_corrs.head(10), '\n')
print(poly_corrs.tail(5))

# train, test set 에 합쳐주기!
# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test, 
                                  columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                                'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Merge polynomial features into training dataframe
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')

# Merge polnomial features into testing dataframe
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')

# Align the dataframes
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)

# Print out the new shapes
print('Training data with polynomial features shape: ', app_train_poly.shape)
print('Testing data with polynomial features shape:  ', app_test_poly.shape)

### 4.6 Domain Knowledge Features

In [None]:
# CREDIT_INCOME_PERCENT, ANNUITY_INCOME_PERCENT, CREDIT_TERM, DAYS_EMPLOYED_PERCENT 특성 만들어내기
app_train_domain = app_train.copy()
app_test_domain = app_test.copy()

app_train_domain['CREDIT_INCOME_PERCENT'] = app_train_domain['AMT_CREDIT'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['ANNUITY_INCOME_PERCENT'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['CREDIT_TERM'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_CREDIT']
app_train_domain['DAYS_EMPLOYED_PERCENT'] = app_train_domain['DAYS_EMPLOYED'] / app_train_domain['DAYS_BIRTH']

app_test_domain['CREDIT_INCOME_PERCENT'] = app_test_domain['AMT_CREDIT'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['ANNUITY_INCOME_PERCENT'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['CREDIT_TERM'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_CREDIT']
app_test_domain['DAYS_EMPLOYED_PERCENT'] = app_test_domain['DAYS_EMPLOYED'] / app_test_domain['DAYS_BIRTH']

### 4.5 One-Hot Encoding  
- one-hot 인코딩으로 범주형 데이터 변형시켜주기(pd.get_dummies)   
- sklearn Labelencoder + OnehotEncoder로도 가능!)  
- Label Encoding : 레이블이 임의로 배정됨, categorical value가 단 두 개(남/여)일 뿐일 때는 잘 먹히나 그렇지 않을 시에는...
- One-Hot Encoding : categorical value를 잘 표현함, 레이블 수가 크게 증가하는 문제가 있으나 PCA로 해결할 수 있음
- 여기서는 2개의 category에는 Label Encoding을, 2개 이상의 category에는 One-Hot Encoding을 사용하기로 한다. 

In [None]:
# Label Encoding
le = LabelEncoder()
le_count = 0


for col in app_train:
    if app_train[col].dtype == 'object':
        if len(list(app_train[col].unique())) <= 2:
            le.fit(app_train[col])
            
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

In [None]:
# One-Hot Encoding
df_train = pd.get_dummies(df_train, columns=['Initial'], prefix='Initial')
df_test = pd.get_dummies(df_test, columns=['Initial'], prefix='Initial')

df_train.head()

df_train = pd.get_dummies(df_train, columns=['Embarked'], prefix='Embarked')
df_test = pd.get_dummies(df_test, columns=['Embarked'], prefix='Embarked')

In [None]:
# OHE 후 shape 확인해주기
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print('Training Features shape :', app_train.shape)
print('Testing Features shape :', app_test.shape)

### 4.6 Aligning Training and Testing Data
- train set과 test set에 같은 수의 feature가 있어야 한다. 하지만 One-Hot Encoding으로 인해 두 수가 같지 않다. 따라서 align을 진행해줘야 한다.(axis=1로 설정해 column에 적용!)

In [None]:
# align training and testing data
train_labels = app_train['TARGET']

app_train, app_test = app_train.align(app_test, join = 'inner', axis=1)

app_train["TARGET"] = train_labels

print('Training Features shape :', app_train.shape)
print('Testing Features shape :', app_test.shape)

### 4.6 Drop Columns  
- 불필요한 column들 drop해주기  
- 예측하고자하느 레이블 제외하고 column이 같은지 확인해주기  

In [None]:
df_train.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)
df_test.drop(['PassengerId', 'Name',  'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)

df_train.head()
df_test.head()

### 4.7 Feature Scaling

In [None]:
scaler = MinMaxScaler(feature_range = (0, 1))
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)
print('Training data shape :', train.shape)
print('Testing data shape :', test.shape)

## 5 Model Selection

### 5.1 train, validation, test set 분리해주기

In [None]:
# 5.1.1 
X_train = df_train.drop('Survived', axis=1).values
target_label = df_train['Survived'].values
X_test = df_test.values

X_tr, X_vld, y_tr, y_vld = train_test_split(X_train, target_label, test_size=0.3, random_state=2018)

### 5.2 Model Lookthrough

In [None]:
kfold = StratifiedKFold(n_splits=10)

random_state = 2
classifiers = []
classifiers.append(SVC(random_state=random_state))
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state=random_state))
classifiers.append(MLPClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier())
classifiers.append(LogisticRegression(random_state = random_state))
classifiers.append(LinearDiscriminantAnalysis())

cv_results = []
for classifier in classifiers:
    cv_results.append(cross_val_score(classifier, X_train, y = target_label, scoring='accuracy', cv = kfold, n_jobs=4))
    
cv_means = []
cv_std = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())
    
cv_res = pd.DataFrame({"CrossValMeans":cv_means,
                       "CrossValerrors": cv_std,
                       "Algorithm":["SVC","DecisionTree","AdaBoost",
                                    "RandomForest","ExtraTrees","GradientBoosting",
                                    "MultipleLayerPerceptron","KNeighboors","LogisticRegression",
                                    "LinearDiscriminantAnalysis"]})

g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

### 5.3 Hyperparameter Tuning

In [None]:
DTC = DecisionTreeClassifier()

adaDTC = AdaBoostClassifier(DTC, random_state=7)

ada_param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "algorithm" : ["SAMME","SAMME.R"],
              "n_estimators" :[1,2],
              "learning_rate":  [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5]}

gsadaDTC = GridSearchCV(adaDTC,param_grid = ada_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsadaDTC.fit(X_train,target_label)

ada_best = gsadaDTC.best_estimator_

gsadaDTC.best_score_

### 5.4 Plot Learning Curve

In [None]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """Generate a simple plot of the test and training learning curve"""
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

g = plot_learning_curve(gsRFC.best_estimator_,"RF mearning curves",X_train,target_label,cv=kfold)
g = plot_learning_curve(gsExtC.best_estimator_,"ExtraTrees learning curves",X_train,target_label,cv=kfold)
g = plot_learning_curve(gsSVMC.best_estimator_,"SVC learning curves",X_train,target_label,cv=kfold)
g = plot_learning_curve(gsadaDTC.best_estimator_,"AdaBoost learning curves",X_train,target_label,cv=kfold)
g = plot_learning_curve(gsGBC.best_estimator_,"GradientBoosting learning curves",X_train,target_label,cv=kfold)

### 5.5 Test on Engineered Features
- poly feature에 대해 impute, scaling 진행하고 train하기   
  -> 결과 보니까 별로 도움안 됨...
- domain feature에 대한 검사  
  -> 별로 도움 안되보이지만 boosting 모델에선 좋은 결과 보임 

In [None]:
poly_features_names = list(app_train_poly.columns)

# impute
imputer = Imputer(strategy = 'median')

poly_features = imputer.fit_transform(app_train_poly)
poly_features_test = imputer.transform(app_test_poly)

# scale
scaler = MinMaxScaler(feature_range=(0,1))

poly_features = scaler.fit_transform(poly_features)
poly_feature_test = scaler.transform(poly_features_test)

random_forest_poly = RandomForestClassifier(n_estimators=100,
                                           random_state=50,
                                           verbose=1,
                                           n_jobs=-1)

# train
random_forest_poly.fit(poly_features, train_labels)

# predict
predictions = random_forest_poly.predict_proba(poly_features_test)[:, 1]

## 6 Evaluation and Application

### 6.1 Evaluate Accuracy

In [None]:
model = RandomForestClassifier()
model.fit(X_tr, y_tr)
prediction = model.predict(X_vld)

print('총 {}명 중 {:.2f}% 정확도로 생존을 맞춤'.format(y_vld.shape[0], 100 * metrics.accuracy_score(prediction, y_vld)))

### 6.2 Feature importace

In [None]:
def plot_feature_importances(df):
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    return df

# Show the feature importances for the default features
feature_importances_sorted = plot_feature_importances(feature_importances)

### 6.3 Submission

In [None]:
submission = pd.read_csv('data/gender_submission.csv')

prediction = model.predict(X_test)
submission['Survived'] = prediction

submission.to_csv('data/my_first_submission.csv', index=False)