<font color = "#CC3D3D">
# Machine Learning Process #
<br>
<img align="left" src="http://www.kdnuggets.com/wp-content/uploads/crisp-dm-4-problems-fig1.png" alt="CRISP-DM">

## Step 1: Business Understanding ##

1. Business Objectives
 - 새로운 개인연금상품(PEP: Personal Equity Plan)을 개발하여 기존 고객들을 대상으로 가능한 많은 계좌를 유치
2. Analytics Goals
 - PEP 가입 예측모형 개발
 - 고객 프로파일 개발
 - 다이렉트 메일 광고 효율성 제고
 - 타겟 메일링에 의한 응답률 제고 

## Step 2: Data Understanding ##
1. 데이터 획득 절차
 - 기존고객 DB로부터 시험메일 발송을 위한 표본고객목록을 추출
 - 새로운 금융상품(PEP)의 제안 메일을 발송
 - 고객의 반응을 기록
2. 분석 데이터
 - 학습용 데이터 600건
 - 신규고객 데이터 200건

In [11]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Collect Initial Data ###

##### for modeling

##### for deployment

### Describe Data ###

In [9]:
df.info()

NameError: name 'df' is not defined

In [None]:
df.head(10)

In [None]:
df.describe()

In [None]:
df.hist(bins=30, figsize=(20,15))

In [None]:
print(new.shape)
new.tail()

### Explore Data ###

##### Look for Correlations #####

In [None]:
corr = df.corr()
corr

In [None]:
plt.matshow(corr)

In [None]:
import seaborn as sn
sn.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)
plt.rcdefaults()

In [None]:
df.corr().pep.sort_values(ascending=False)

##### Detect Outliers #####
<img align="left" src="http://www.whatissixsigma.net/wp-content/uploads/2015/07/Box-Plot-Diagram-to-identify-Outliers-figure-1.png" alt="Boxplot Outlier">

In [None]:
df.loc[:,['age','income']].plot.box(subplots=True, layout=(2,1), figsize=(5,5))

## Step 3: Data Preparation ##

### Clean Data ###
##### Replace Missing Values #####

In [None]:
mdf = df.copy()

In [None]:
mdf.age.mean()

In [8]:
mdf.age.value_counts()

NameError: name 'mdf' is not defined

In [None]:
#inplace : fillna를 이용해 nan 데이터를 어떤 값으로 채운 후에
#실제 데이터프레임에 저장할지를 결정하는 변수
mdf.age.fillna(round(mdf.age.mean(),0), inplace=True)
mdf

### Construct Data ###
##### Derive Attributes #####

In [None]:
mdf['realincome'] = np.where(mdf['children']==0, mdf['income'], mdf['income']/mdf['children'])
mdf.head()

<font color = "blue">
***[numpy.where(condition, x, y)](https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html#numpy.where)***<br>
*return elements, either from x or y, depending on condition*

### Select Data ###
##### Filter Attributes  #####

In [None]:
columns = ['income', 'children', 'current_act', 'car', 'mortgage', 'region']
mdf = mdf.drop(columns, axis=1)
mdf.head()

### Split Data ###
<img align="left" src="https://www.developer.com/imagesvr_ce/6793/ML4.png" width=500 height=500 alt="Boxplot Outlier">

In [None]:
from sklearn.model_selection import train_test_split  # for Hold-out validation

In [None]:
dfX = mdf.drop(['id','pep'], axis=1)  # exclude 'id' attribute & class variable
dfy = mdf['pep']                      # class variable
X_train, X_test, y_train, y_test = train_test_split(dfX, dfy, test_size=0.25, random_state=0)

In [None]:
print(X_train.shape, X_test.shape)                    

In [None]:
X_train.head()

## Step 4: Modeling ##

### Build Model ###

##### Decision Trees #####

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier(max_depth=6, random_state=0)

In [None]:
tree.fit(X_train, y_train)

In [None]:
pred_tree = tree.predict(X_test); pred_tree

##### Logistic Regresssion

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lreg = LogisticRegression(random_state=0)

In [None]:
lreg.fit(X_train, y_train)

### Assess Model ###

##### Decision Trees #####

In [None]:
tree.score(X_train, y_train)

In [None]:
tree.score(X_test, y_test)

##### Logistic Regresssion

In [None]:
lreg.score(X_test, y_test)

## Step 5: Evaluation ##

<font color = "red">
- *Which model is the best ?*
- *Is the model useful ?*
<font>

In [None]:
best_model = tree   # Change this code if the best model is not decision tree.
best_model.score(X_test, y_test)

In [None]:
from sklearn.dummy import DummyClassifier
print(y_test.value_counts())
DummyClassifier(strategy='most_frequent').fit(X_train, y_train).score(X_test, y_test)

## Step 6: Deployment ##

In [None]:
# You must do the same preprocessing as the modeling data.
ndf = new.copy()
# 자녀가 없으면 수입을 실제 수입으로 잡고 자녀가 있으면 자녀 만큼 나눈값을
# 실제 수입으로 잡아라
ndf['realincome'] = np.where(ndf['children']==0, ndf['income'], ndf['income']/ndf['children'])
ndf = ndf.drop(columns, axis=1)
ndf.head()

### Apply the best model to select target customers ###

In [None]:
ndf['pred'] = best_model.predict(ndf.loc[:,'age':'realincome'])

In [None]:
print(best_model.predict_proba(ndf.loc[:,'age':'realincome']))
ndf['pred_prob'] = best_model.predict_proba(ndf.loc[:,'age':'realincome'])[:,1]

In [None]:
ndf.head()

In [None]:
target = ndf.query('pred == 1 & pred_prob > 0.7')  # PEP에 가입할 확율이 70%가 넘는 고객만 추출
target.sort_values(by="pred_prob", ascending=False).to_csv("./pep_target.csv", index=False)
pd.read_csv("./pep_target.csv").tail()