## We want to see if a user make a new service subscription transaction within 30 days after their current membership expiration date.

The churn/renewal definition can be tricky due to KKBox's subscription model. Since the majority of KKBox's subscription length is 30 days, a lot of users re-subscribe every month. The key fields to determine churn/renewal are transaction date, membership expiration date, and is_cancel. Note that the is_cancel field indicates whether a user actively cancels a subscription. Note that a cancellation does not imply the user has churned. A user may cancel service subscription due to change of service plans or other reasons. **The criteria of "churn" is no new valid service subscription within 30 days after the current membership expires. **

The train and the test data are selected from users whose membership expire within a certain month. The train data consists of users whose subscription expires within the month of February 2017, and the test data is with users whose subscription expires within the month of March 2017. This means we are looking at user churn or renewal roughly in the month of March 2017 for train set, and the user churn or renewal roughly in the month of April 2017. Train and test sets are split by transaction date, as well as the public and private leaderboard data.

In this dataset, KKBox has included more users behaviors than the ones in train and test datasets, in order to enable participants to explore different user behaviors outside of the train and test sets. For example, a user could actively cancel the subscription, but renew within 30 days.

UPDATE: As of November 6, 2017, we have refreshed the test data to predict user churn in the month of April, 2017.

* msno: user id
* is_churn: This is the target variable. Churn is defined as whether the user did not continue the subscription within 30 days of expiration. __is_churn = 1__ means churn, __is_churn = 0__ means renewal.
* payment_method_id: payment method
* payment_plan_days: length of membership plan in days
* plan_list_price: in New Taiwan Dollar (NTD)
* actual_amount_paid: in New Taiwan Dollar (NTD)
* transaction_date: format %Y%m%d
* membership_expire_date: format %Y%m%d
* is_cancel: whether or not the user canceled the membership in this transaction.

In [1]:
import numpy as np
import pandas as pd
import dask.dataframe as dd

In [2]:
transactionDf = dd.read_csv('data/transactions.csv')

# transactionDf.compute()

In [3]:
# meta = ('membership_expire_date', pd.Timestamp)
def expireMonth(df):
    return df['membership_expire_date']/100

def transMonth(df):
    return df['transaction_date']/100


# transactionDf = transactionDf.assign(membership_expire_month=transactionDf.map_partitions(parseDT,meta=pd.Timestamp ))
transactionDf['transaction_month'] = (transactionDf['transaction_date']/100).astype(int)
transactionDf['membership_expire_month']=(transactionDf['membership_expire_date']/100).astype(int)
# transactionDf.compute()

In [4]:
## for each of the user, groupby month, is_cancel,

In [19]:
userDf = dd.read_csv('data/user_logs_v2.csv')
userDf['month'] = (userDf['date']/100).astype(int)
# userDf.head()

In [7]:
trainDf = pd.read_csv('data/sample_submission_v2.csv')
uniqueUsers = trainDf.msno
uniqueUsers
# uniqueUsers.compute()

0         4n+fXlyJvfQnTeKXTWT507Ll4JVYGrOC8LHCfwBmPE4=
1         aNmbC1GvFUxQyQUidCVmfbQ0YeCuwkPzEdQ0RwWyeZM=
2         rFC9eSG/tMuzpre6cwcMLZHEYM89xY02qcz7HL4//jc=
3         WZ59dLyrQcE7ft06MZ5dj40BnlYQY7PHgg/54+HaCSE=
4         aky/Iv8hMp1/V/yQHLtaVuEmmAxkB5GuasQZePJ7NU4=
5         nu1jZ/flvSaXRg0U9Es+xmo2KlAXq/q+mhcWFStwm9w=
6         biDTtgK83fEWXJDaxzD1eh22dion/h3odeIbS7qJMmY=
7         k/8uwi/iM9LZmRAIWXLqpZY6ENomXAscwsQsh6PxcTw=
8         18rQ/746SjA6nBx325UsyhfsDhu4tK01FXFxHWZjw20=
9         2V13OCoWx6vqKr/ZzNmKFrmnC2FtR4SWMz5C5Hi02PY=
10        1l/ZwduFxS/q/hZeyssAYH27espkp8Yw6uAnUxfEbTI=
11        azfnO16ZeQsbJF6LcqkQhbA3NWiqHYWqaq7AFjsJVaQ=
12        RPOzeEr8mSbhj6wrF29+7KciuiNrj7IvkzxJ9rgCTks=
13        NAzfjSM2EOyFhV4rIm/RO9pXCbyti6scBfcmV/t+CaU=
14        1DCd06ON0rWFHI1bNrY1l/hPW9d80fmmrmroHqpGvNA=
15        D9QAV8ZNF8qU96dTBLMzO0sguzlmAIBf4302l0W6jj0=
16        5HKzLDUVVbIxWMH9aH67ALAGVPvorE4NvmO5xqO7SMk=
17        XwnlNj6nq2MMHe0KoyRRM4ih+RAwj5idHvlS4pTMTbg=
18        

In [20]:
groupedUser = userDf.groupby(['msno','month']).aggregate( 'mean')

#compute() returns a pandas dataframe
groupedUser = groupedUser.compute()
# groupedUser.head()

In [21]:
groupedUser.to_csv('data/user_grouped_v2.csv')
# type(groupedUser)