## 기계학습 모델링 과정
> 1. 데이터 수집
> 2. 데이터 클리닝
> 3. ML 알고리즘 고르기
> 4. 모델 평가하기
> 5. 모델 개선하기

## 모델 학습 및 평가
- 수치 예측 문제:
얼마나 가깝게 예측했는가?
- 범주 예측 문제:
몇 개나 맞추었는가?, 실제 값과 예측 값 사이의 일치 여부를 기준으로 사용

### 랜덤포레스트
- 다수의 다른 의사결정나무를 이용해 앙상블을 사용하는 알고리즘
- 하이퍼파라미터: 몇 개의 나무를 만들 것인가?

### 서포트 백터 머신
- 두 범주를 잘 분류하면서 마진이 최대화되는 초평면을 찾는 알고리즘
- 하이퍼파라미터: c 파라미터의 값을 몇으로 설정할 것인가?
### ANN
- 입력층과 출력층 사이에 존재하는 은닉층을 이용하여 최적의 가중치를 학습시키는 것을 목적으로 하는 알고리즘
- 하이퍼파라미터: 모델의 은닉 뉴런의 개수를 어떻게 구성할 것인가?

# 기계학습 활용 고객이탈 예측 실습
- churn(leaving the company) prediction

### 데이터 전처리
- 데이터 병합(SQL inner join, outer join) pd.merge
- 데이터 밸런싱(*언더샘플링*, 오버샘플링)

In [1]:
import os
import numpy as np
import pandas as pd

## 1.1 데이터 준비
- 읽어오기
- 병합
- 원핫 인코팅
- 결측치 제거
- 데이터 밸런싱 (undersampling)
- 데이터 분리 (훈련/검증/테스트)

In [4]:
# 데이터 읽어오기 (Customer data / Transaction data)
customer_df = pd.read_csv("churning_customers.csv", encoding='cp949')

customer_df.head()
customer_df.shape

(20000, 9)

In [5]:
customer_df.drop(['개시일','지불방법','핸드셋'], axis=1, inplace=True)

In [6]:
customer_df.head()

Unnamed: 0,고객ID,성별,연령,서비스기간,단선횟수,요금제
0,K100010,남,46,15.066667,1,CAT50
1,K100020,남,27,37.333333,0,CAT50
2,K100030,남,39,50.366667,2,CAT50
3,K100040,남,28,26.2,2,CAT50
4,K100050,남,47,26.433333,0,CAT50


In [8]:
transaction_df = pd.read_csv("churning_transactions.csv",
                            encoding='cp949')
transaction_df.head()

Unnamed: 0,고객ID,주간통화횟수,주간통화시간_분,야간통화횟수,야간통화시간_분,주말통화횟수,주말통화시간_분,국제통화시간_분,국내통화요금_분,평균주간통화시간,평균야간통화시간,평균주말통화시간,국내통화횟수,국내통화시간_분,평균국내통화시간,총통화시간_분,이탈여부
0,K100010,14,36.131353,10,7.973121,24,14.533282,1.443889,0.0,2.580811,0.797312,0.605553,48,58.637756,1.22162,60.081645,이탈
1,K100020,54,39.437279,34,21.152722,0,0.0,9.779366,0.0,0.73032,0.622139,0.0,88,60.590001,0.688523,70.369367,이탈
2,K100030,44,72.6,1,27.6,22,37.200001,16.601092,0.0,1.65,27.6,1.690909,67,137.400001,2.050746,154.001093,이탈
3,K100040,44,72.6,1,27.6,22,37.2,16.601076,0.0,1.65,27.6,1.690909,67,137.4,2.050746,154.001076,이탈
4,K100050,32,40.608449,14,18.823708,1,1.233764,4.473546,0.0,1.269014,1.344551,1.233764,47,60.665921,1.290764,65.139467,이탈


In [9]:
transaction_df.shape

(20000, 17)

In [11]:
transaction_df['이탈여부'] = transaction_df['이탈여부'].map({"이탈":1, "유지":0})

In [12]:
transaction_df.head()

Unnamed: 0,고객ID,주간통화횟수,주간통화시간_분,야간통화횟수,야간통화시간_분,주말통화횟수,주말통화시간_분,국제통화시간_분,국내통화요금_분,평균주간통화시간,평균야간통화시간,평균주말통화시간,국내통화횟수,국내통화시간_분,평균국내통화시간,총통화시간_분,이탈여부
0,K100010,14,36.131353,10,7.973121,24,14.533282,1.443889,0.0,2.580811,0.797312,0.605553,48,58.637756,1.22162,60.081645,1
1,K100020,54,39.437279,34,21.152722,0,0.0,9.779366,0.0,0.73032,0.622139,0.0,88,60.590001,0.688523,70.369367,1
2,K100030,44,72.6,1,27.6,22,37.200001,16.601092,0.0,1.65,27.6,1.690909,67,137.400001,2.050746,154.001093,1
3,K100040,44,72.6,1,27.6,22,37.2,16.601076,0.0,1.65,27.6,1.690909,67,137.4,2.050746,154.001076,1
4,K100050,32,40.608449,14,18.823708,1,1.233764,4.473546,0.0,1.269014,1.344551,1.233764,47,60.665921,1.290764,65.139467,1


## 1.2 데이터 병합
- pd.merge(): 공통된 키가 있을 때 사용
- pd.concat()

In [13]:
churn_df = pd.merge(customer_df, transaction_df, how='inner', on='고객ID')

In [14]:
churn_df.head()

Unnamed: 0,고객ID,성별,연령,서비스기간,단선횟수,요금제,주간통화횟수,주간통화시간_분,야간통화횟수,야간통화시간_분,...,국제통화시간_분,국내통화요금_분,평균주간통화시간,평균야간통화시간,평균주말통화시간,국내통화횟수,국내통화시간_분,평균국내통화시간,총통화시간_분,이탈여부
0,K100010,남,46,15.066667,1,CAT50,14,36.131353,10,7.973121,...,1.443889,0.0,2.580811,0.797312,0.605553,48,58.637756,1.22162,60.081645,1
1,K100020,남,27,37.333333,0,CAT50,54,39.437279,34,21.152722,...,9.779366,0.0,0.73032,0.622139,0.0,88,60.590001,0.688523,70.369367,1
2,K100030,남,39,50.366667,2,CAT50,44,72.6,1,27.6,...,16.601092,0.0,1.65,27.6,1.690909,67,137.400001,2.050746,154.001093,1
3,K100040,남,28,26.2,2,CAT50,44,72.6,1,27.6,...,16.601076,0.0,1.65,27.6,1.690909,67,137.4,2.050746,154.001076,1
4,K100050,남,47,26.433333,0,CAT50,32,40.608449,14,18.823708,...,4.473546,0.0,1.269014,1.344551,1.233764,47,60.665921,1.290764,65.139467,1


In [15]:
churn_df.shape

(20000, 22)

In [16]:
churn_df.drop(['고객ID'], axis=1, inplace=True)

In [17]:
churn_df.shape

(20000, 21)

## 1.3 One-hot encoding

In [18]:
churn_df = pd.get_dummies(churn_df)

In [19]:
churn_df.head(2)

Unnamed: 0,연령,서비스기간,단선횟수,주간통화횟수,주간통화시간_분,야간통화횟수,야간통화시간_분,주말통화횟수,주말통화시간_분,국제통화시간_분,...,평균국내통화시간,총통화시간_분,이탈여부,성별_남,성별_여,요금제_CAT100,요금제_CAT200,요금제_CAT50,요금제_Play100,요금제_Play300
0,46,15.066667,1,14,36.131353,10,7.973121,24,14.533282,1.443889,...,1.22162,60.081645,1,1,0,0,0,1,0,0
1,27,37.333333,0,54,39.437279,34,21.152722,0,0.0,9.779366,...,0.688523,70.369367,1,1,0,0,0,1,0,0


In [22]:
cols = list(churn_df.columns)
cols.remove('이탈여부')
cols.append('이탈여부')
cols

['연령',
 '서비스기간',
 '단선횟수',
 '주간통화횟수',
 '주간통화시간_분',
 '야간통화횟수',
 '야간통화시간_분',
 '주말통화횟수',
 '주말통화시간_분',
 '국제통화시간_분',
 '국내통화요금_분',
 '평균주간통화시간',
 '평균야간통화시간',
 '평균주말통화시간',
 '국내통화횟수',
 '국내통화시간_분',
 '평균국내통화시간',
 '총통화시간_분',
 '성별_남',
 '성별_여',
 '요금제_CAT100',
 '요금제_CAT200',
 '요금제_CAT50',
 '요금제_Play100',
 '요금제_Play300',
 '이탈여부']

In [23]:
churn_df = churn_df[cols]

## 1.4 결측치 탐색 & 제거

In [25]:
churn_df.isnull().sum()

연령             0
서비스기간          0
단선횟수           0
주간통화횟수         0
주간통화시간_분       0
야간통화횟수         0
야간통화시간_분       0
주말통화횟수         0
주말통화시간_분       0
국제통화시간_분       0
국내통화요금_분       0
평균주간통화시간       0
평균야간통화시간       0
평균주말통화시간       0
국내통화횟수         0
국내통화시간_분       0
평균국내통화시간       0
총통화시간_분        0
성별_남           0
성별_여           0
요금제_CAT100     0
요금제_CAT200     0
요금제_CAT50      0
요금제_Play100    0
요금제_Play300    0
이탈여부           0
dtype: int64

## 1.5 Balancing
- Undersampling

In [26]:
churn_df['이탈여부'].value_counts()

0    10069
1     9931
Name: 이탈여부, dtype: int64

In [27]:
major_class, minor_class = churn_df['이탈여부'].value_counts()

In [28]:
major_data = churn_df[churn_df['이탈여부'] == 0]
minor_data = churn_df[churn_df['이탈여부'] == 1]

In [29]:
under_data = major_data.sample(n=minor_class, random_state=123)
under_data.shape

(9931, 26)

In [30]:
balanced_data = pd.concat([under_data, minor_data], axis=0)

In [31]:
balanced_data['이탈여부'].value_counts()

0    9931
1    9931
Name: 이탈여부, dtype: int64

## 1.6 Data Split

In [None]:
#train, val, test data
train_data = balanced_data.sample(frac=0.8)