# **음악 스트리밍 서비스 kkbox의 고객 이탈률 예측**</br>


## **프로젝트의 필요성**</br>
> </br>
> 기업의 입장에서, 매출을 발생시키는 고객을 신규 유입시키는 데 발생하는 비용보다 기존 고객을 이탈하지 않도록 유지하는 비용이 더 낮음.
>
> 특히 음악 스트리밍과 같은 구독형 비즈니스 모델을 가진 서비스의 경우 기존 고객이 이탈하지 않도록 유지하는 것은 현재의 매출만이 아닌 미래의 현금 흐름 창출에도 중요한 요소로 작용.
> 
> 또한 고객의 이탈 데이터를 분석하여 현재 서비스 개선에 대한 의사결정의 근거를 마련할 수 있음.</br>
> </br>

## **목표**</br>
> </br>
> - 1차 : 머신 러닝을 통한 고객 전환 예측</br></br>
> - 2차 : 대쉬보드 생성</br></br>
> - 3차 : Slack을 통해 이탈 가능성이 높은 고객의 정보를 자동으로 전달할 수 있도록 시스템 구축</br>
></br>


In [29]:
import matplotlib.pyplot as plt
import pandas as pd
from pyspark.sql import SparkSession

In [30]:
spark = SparkSession.builder.appName('predict_chrun_rate').getOrCreate()

In [31]:
train_df_v2 = spark.read.option('header', 'true').csv('D:/kkbox-churn-prediction-challenge/data/churn_comp_refresh/train_v2.csv', inferSchema=True)
train_df = spark.read.option('header', 'true').csv('D:/kkbox-churn-prediction-challenge/data/churn_comp_refresh/train.csv', inferSchema=True)
members_df_v3 = spark.read.option('header', 'true').csv('D:/kkbox-churn-prediction-challenge/data/churn_comp_refresh/members_v3.csv', inferSchema=True)
transactions_df_v2 = spark.read.option('header', 'true').csv('D:/kkbox-churn-prediction-challenge/data/churn_comp_refresh/transactions_v2.csv', inferSchema=True)
transactions_df = spark.read.option('header', 'true').csv('D:/kkbox-churn-prediction-challenge/data/churn_comp_refresh/transactions.csv', inferSchema=True)
user_logs_df_v2 = spark.read.option('header', 'true').csv('D:/kkbox-churn-prediction-challenge/data/churn_comp_refresh/user_logs_v2.csv', inferSchema=True)
user_logs_df = spark.read.option('header', 'true').csv('D:/kkbox-churn-prediction-challenge/data/churn_comp_refresh/user_logs.csv', inferSchema=True)

In [33]:
print('train_df : ', train_df.count(), len(train_df.columns))
print('train_df_v2 : ', train_df_v2.count(), len(train_df_v2.columns))
print('members_df_v3 : ', members_df_v3.count(), len(members_df_v3.columns))
print('transactions_df : ', transactions_df.count(), len(transactions_df.columns))
print('transactions_df_v2 : ', transactions_df_v2.count(), len(transactions_df_v2.columns))
print('user_logs_df : ', user_logs_df.count(), len(user_logs_df.columns))
print('user_logs_df_v2 : ', user_logs_df_v2.count(), len(user_logs_df_v2.columns))

train_df :  992931 2
train_df_v2 :  970960 2
members_df_v3 :  6769473 6
transactions_df :  21547746 9
transactions_df_v2 :  1431009 9
user_logs_df :  392106543 9
user_logs_df_v2 :  18396362 9


In [34]:
print(train_df.columns)
print(members_df_v3.columns)
print(transactions_df.columns)
print(user_logs_df.columns)

['msno', 'is_churn']
['msno', 'city', 'bd', 'gender', 'registered_via', 'registration_init_time']
['msno', 'payment_method_id', 'payment_plan_days', 'plan_list_price', 'actual_amount_paid', 'is_auto_renew', 'transaction_date', 'membership_expire_date', 'is_cancel']
['msno', 'date', 'num_25', 'num_50', 'num_75', 'num_985', 'num_100', 'num_unq', 'total_secs']


### train_df 살펴보기

In [18]:
print((train_df['is_churn'] == 0).sum(), (train_df['is_churn'] == 1).sum())
print((train_df['is_churn'] == 1).sum()/(len(train_df))*100)

883630 87330
8.994191315811156


In [6]:
train_df.isnull().sum()

msno        0
is_churn    0
dtype: int64

In [7]:
len(train_df['msno'].unique())

970960

### members_df 살펴보기

- members_df는 v2가 없고 v3만 있음
- v3는 2017년 11월 13일에 업데이트가 진행되었으며, 다른 데이터의 경우 2017년 3월 31일까지의 데이터만 담고 있음

In [8]:
members_df.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time
0,Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8=,1,0,,11,20110911
1,+tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU=,1,0,,7,20110914
2,cV358ssn7a0f7jZOwGNWS07wCKVqxyiImJUX6xcIwKw=,1,0,,11,20110915
3,9bzDeJP6sQodK73K5CBlJ6fgIQzPeLnRl0p5B77XP+g=,1,0,,11,20110915
4,WFLY3s7z4EZsieHCt63XrsdtfTEmJ+2PnnKLH5GY4Tk=,6,32,female,9,20110915


In [9]:
members_df.isnull().sum()

msno                            0
city                            0
bd                              0
gender                    4429505
registered_via                  0
registration_init_time          0
dtype: int64

In [10]:
(members_df.bd == 0).sum()

4540215

In [11]:
members_df.city.unique()

array([ 1,  6,  4,  5, 13, 22, 12, 15, 11,  9, 14,  8, 18, 21,  3,  7, 17,
       10, 20, 16, 19], dtype=int64)

In [12]:
members_df.registered_via.unique()

array([11,  7,  9,  3, 16,  4, 13, 17,  5,  2, 19,  8,  6, 14,  1, 18, 10,
       -1], dtype=int64)

In [13]:
(members_df.registered_via == -1).sum()

1

### transaction_df 살펴보기

In [14]:
transactions_df.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,32,90,298,298,0,20170131,20170504,0
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,41,30,149,149,1,20150809,20190412,0
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,36,30,180,180,1,20170303,20170422,0
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,36,30,180,180,1,20170329,20170331,1
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,41,30,99,99,1,20170323,20170423,0


In [23]:
transactions_df.transaction_date.min(), transactions_df.transaction_date.max()

(20150101, 20170331)

In [15]:
transactions_df.shape

(1431009, 9)

In [20]:
len(transactions_df.msno.unique())

1197050

In [16]:
user_logs_df.head()

Unnamed: 0,msno,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
0,u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=,20170331,8,4,0,1,21,18,6309.273
1,nTeWW/eOZA/UHKdD5L7DEqKKFTjaAj3ALLPoAWsU8n0=,20170330,2,2,1,0,9,11,2390.699
2,2UqkWXwZbIjs03dHLU9KHJNNEvEkZVzm69f3jCS+uLI=,20170331,52,3,5,3,84,110,23203.337
3,ycwLc+m2O0a85jSLALtr941AaZt9ai8Qwlg9n0Nql5U=,20170331,176,4,2,2,19,191,7100.454
4,EGcbTofOSOkMmQyN1NMLxHEXJ1yV3t/JdhGwQ9wXjnI=,20170331,2,1,0,1,112,93,28401.558


In [17]:
len(user_logs_df.msno.unique())

1103894

In [25]:
user_logs_df.date.min(), user_logs_df.date.max()

(20170301, 20170331)