# 1.4 Generate Cutoff Times

##### Description

FeatureTools requires a key, value store of cutoff times in order to extract features. We will use their python scripts, found at [github](https://github.com/Featuretools/predict-customer-churn/blob/master/churn/2.%20Prediction%20Engineering.ipynb). The scripts could be converted to UDF's to run on the Spark cluster, but are instead executed locally for simplicity and then manually transferred over.

##### Notebook Steps

1. Input Data
1. Run label generation
1. Output Data

## 1. Input Data

In [3]:
import pandas as pd
import numpy as np
import datetime

df = pd.read_csv('../../data/raw/transactions_v2.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1431009 entries, 0 to 1431008
Data columns (total 9 columns):
 #   Column                  Non-Null Count    Dtype 
---  ------                  --------------    ----- 
 0   msno                    1431009 non-null  object
 1   payment_method_id       1431009 non-null  int64 
 2   payment_plan_days       1431009 non-null  int64 
 3   plan_list_price         1431009 non-null  int64 
 4   actual_amount_paid      1431009 non-null  int64 
 5   is_auto_renew           1431009 non-null  int64 
 6   transaction_date        1431009 non-null  int64 
 7   membership_expire_date  1431009 non-null  int64 
 8   is_cancel               1431009 non-null  int64 
dtypes: int64(8), object(1)
memory usage: 98.3+ MB


In [4]:
df['transaction_date'] = pd.to_datetime(df['transaction_date'], format='%Y%m%d')
df['membership_expire_date'] = pd.to_datetime(df['membership_expire_date'], format='%Y%m%d')

## 2. Run Label Generation

##### NOTE: These scripts were copied from FeatureTools Github Repository

In [8]:
def label_customer(customer_id, customer_transactions, prediction_date, churn_days, 
                   lead_time = 1, prediction_window = 1, return_trans = False):
    """
    Make label times for a single customer. Returns a dataframe of labels with times, the binary label, 
    and the number of days until the next churn.
       
    Params
    --------
        customer_id (str): unique id for the customer
        customer_transactions (dataframe): transactions dataframe for the customer
        prediction_date (str): time at which predictions are made. Either "MS" for the first of the month
                               or "SMS" for the first and fifteenth of each month 
        churn_days (int): integer number of days without an active membership required for a churn. A churn is
                          defined by exceeding this number of days without an active membership.
        lead_time (int): number of periods in advance to make predictions for. Defaults to 1 (preditions for one offset)
        prediction_window(int): number of periods over which to consider churn. Defaults to 1.
        return_trans (boolean): whether or not to return the transactions for analysis. Defaults to False.
        
    Return
    --------
        label_times (dataframe): a table of customer id, the cutoff times at the specified frequency, the 
                                 label for each cutoff time, the number of days until the next churn for each
                                 cutoff time, and the date on which the churn itself occurred.
        transactions (dataframe): [optional] dataframe of customer transactions if return_trans = True. Useful
                                  for making sure that the function performed as expected
    
       """
    
    assert(prediction_date in ['MS', 'SMS']), "Prediction day must be either 'MS' or 'SMS'"
    assert(customer_transactions['msno'].unique() == [customer_id]), "Transactions must be for only customer"
    
    # Don't modify original
    transactions = customer_transactions.copy()
    
    # Make sure to sort chronalogically
    transactions.sort_values(['transaction_date', 'membership_expire_date'], inplace = True)
    
    # Create next transaction date by shifting back one transaction
    transactions['next_transaction_date'] = transactions['transaction_date'].shift(-1)
    
    # Find number of days between membership expiration and next transaction
    transactions['difference_days'] = (transactions['next_transaction_date'] - 
                                       transactions['membership_expire_date']).\
                                       dt.total_seconds() / (3600 * 24)
    
    # Determine which transactions are associated with a churn
    transactions['churn'] = transactions['difference_days'] > churn_days
    
    # Find date of each churn
    transactions.loc[transactions['churn'] == True, 
                     'churn_date'] = transactions.loc[transactions['churn'] == True, 
                                                      'membership_expire_date'] + pd.Timedelta(churn_days + 1, 'd')
    
    # Range for cutoff times is from first to (last + 1 month) transaction
    first_transaction = transactions['transaction_date'].min()
    last_transaction = transactions['transaction_date'].max()
    start_date = datetime.datetime(first_transaction.year, first_transaction.month, 1)
    
    # Handle December
    if last_transaction.month == 12:
        end_date = datetime.datetime(last_transaction.year + 1, 1, 1)
    else:
        end_date = datetime.datetime(last_transaction.year, last_transaction.month + 1, 1)
    
    # Make label times dataframe with cutoff times corresponding to prediction date
    label_times = pd.DataFrame({'cutoff_time': pd.date_range(start_date, end_date, freq = prediction_date),
                                'msno': customer_id
                               })
    
    # Use the lead time and prediction window parameters to establish the prediction window 
    # Prediction window is for each cutoff time
    label_times['prediction_window_start'] = label_times['cutoff_time'].shift(-lead_time)
    label_times['prediction_window_end'] = label_times['cutoff_time'].shift(-(lead_time + prediction_window))
    
    previous_churn_date = None

    # Iterate through every cutoff time
    for i, row in label_times.iterrows():
        
        # Default values if unknown
        churn_date = pd.NaT
        label = np.nan
        # Find the window start and end
        window_start = row['prediction_window_start']
        window_end = row['prediction_window_end']
        # Determine if there were any churns during the prediction window
        churns = transactions.loc[(transactions['churn_date'] >= window_start) & 
                                  (transactions['churn_date'] < window_end), 'churn_date']

        # Positive label if there was a churn during window
        if not churns.empty:
            label = 1
            churn_date = churns.values[0]

            # Find number of days until next churn by 
            # subsetting to cutoff times before current churn and after previous churns
            if not previous_churn_date:
                before_idx = label_times.loc[(label_times['cutoff_time'] <= churn_date)].index
            else:
                before_idx = label_times.loc[(label_times['cutoff_time'] <= churn_date) & 
                                             (label_times['cutoff_time'] > previous_churn_date)].index

            # Calculate days to next churn for cutoff times before current churn
            label_times.loc[before_idx, 'days_to_churn'] = (churn_date - label_times.loc[before_idx, 
                                                                                         'cutoff_time']).\
                                                            dt.total_seconds() / (3600 * 24)
            previous_churn_date = churn_date
        # No churns, but need to determine if an active member
        else:
            # Find transactions before the end of the window that were not cancelled
            transactions_before = transactions.loc[(transactions['transaction_date'] < window_end) & 
                                                   (transactions['is_cancel'] == False)].copy()
            # If the membership expiration date for this membership is after the window start, the custom has not churned
            if np.any(transactions_before['membership_expire_date'] >= window_start):
                label = 0

        # Assign values
        label_times.loc[i, 'label'] = label
        label_times.loc[i, 'churn_date'] = churn_date
        
        # Handle case with no churns
        if not np.any(label_times['label'] == 1):
            label_times['days_to_churn'] = np.nan
            label_times['churn_date'] = pd.NaT
        
    if return_trans:
        return label_times.drop(columns = ['msno']), transactions
    
    return label_times[['msno', 'cutoff_time', 'label', 'days_to_churn', 'churn_date']].copy()

In [9]:
def make_label_times(transactions, prediction_date, churn_days, 
                   lead_time = 1, prediction_window = 1,):
    """
    Make labels for an entire series of transactions. 
    
    Params
    --------
        transactions (dataframe): table of customer transactions
        prediction_date (str): time at which predictions are made. Either "MS" for the first of the month
                               or "SMS" for the first and fifteenth of each month 
        churn_days (int): integer number of days without an active membership required for a churn. A churn is
                          defined by exceeding this number of days without an active membership.
        lead_time (int): number of periods in advance to make predictions for. Defaults to 1 (preditions for one offset)
        prediction_window(int): number of periods over which to consider churn. Defaults to 1.
    Return
    --------
        label_times (dataframe): a table with customer ids, cutoff times, binary label, regression label, 
                                 and date of churn. This table can then be used for feature engineering.
    """
    
    label_times = []
    transactions = transactions.sort_values(['msno', 'transaction_date'])
    
    # Iterate through each customer and find labels
    for customer_id, customer_transactions in transactions.groupby('msno'):
        lt_cust = label_customer(customer_id, customer_transactions,
                                                   prediction_date, churn_days, 
                                                   lead_time, prediction_window)
        
        label_times.append(lt_cust)
        
    # Concatenate into a single dataframe
    return pd.concat(label_times)

In [71]:
customer_id = df.msno[65904]
customer_id

'AVWhEK/wxBSPrAdTtP0SiGUX+lt2ozzv13GGvfxNlTw='

In [72]:
customer = df[df.msno == customer_id]
customer

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
65904,AVWhEK/wxBSPrAdTtP0SiGUX+lt2ozzv13GGvfxNlTw=,38,410,1788,1788,0,2016-02-29,2017-04-14,0


In [73]:
customer_labels, customer_transactions = label_customer(customer_id, customer, prediction_date = 'MS', churn_days = 31, return_trans=True)

In [74]:
customer_labels

Unnamed: 0,cutoff_time,prediction_window_start,prediction_window_end,label,churn_date,days_to_churn
0,2016-02-01,2016-03-01,NaT,,NaT,
1,2016-03-01,NaT,NaT,,NaT,


In [75]:
customer_transactions

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,next_transaction_date,difference_days,churn,churn_date
65904,AVWhEK/wxBSPrAdTtP0SiGUX+lt2ozzv13GGvfxNlTw=,38,410,1788,1788,0,2016-02-29,2017-04-14,0,NaT,,False,NaT


#### Run Operation

This nested operation is run against the transactions dataframe, and outputs a new dataframe which will be saved to file.

In [None]:
label_times = make_label_times(df, prediction_date = 'MS', churn_days = 31,
                               lead_time = 1, prediction_window = 1)

*an ode while we wait...*


*loops resemble molasses;*



*when run on big data!*

In [None]:
label_times.head()

In [None]:
label_times.tail()

In [None]:
label_times.info()

In [None]:
label_times = label_times.reset_index()

In [None]:
label_times.info()

In [None]:
label_times.label.sum()

In [10]:
label_times = make_label_times(df, prediction_date = 'MS', churn_days = 31,
                               lead_time = 1, prediction_window = 1)

*an ode while we wait...*


*loops resemble molasses;*



*when run on big data!*

In [11]:
label_times.head()

Unnamed: 0,msno,cutoff_time,label,days_to_churn,churn_date
0,+++IZseRRiQS9aaSkH6cMYU6bGDcxUieAi/tH67sC5s=,2016-10-01,,,NaT
1,+++IZseRRiQS9aaSkH6cMYU6bGDcxUieAi/tH67sC5s=,2016-11-01,,,NaT
0,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,2017-03-01,,,NaT
1,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,2017-04-01,,,NaT
0,+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=,2017-02-01,0.0,,NaT


In [12]:
label_times.tail()

Unnamed: 0,msno,cutoff_time,label,days_to_churn,churn_date
0,zzz1Dc3P9s53HAowRTrm3fNsWju5yeN4YBfNDq7Z99Q=,2017-02-01,0.0,,NaT
1,zzz1Dc3P9s53HAowRTrm3fNsWju5yeN4YBfNDq7Z99Q=,2017-03-01,,,NaT
2,zzz1Dc3P9s53HAowRTrm3fNsWju5yeN4YBfNDq7Z99Q=,2017-04-01,,,NaT
0,zzzF1KsGfHH3qI6qiSNSXC35UXmVKMVFdxkp7xmDMc0=,2017-03-01,,,NaT
1,zzzF1KsGfHH3qI6qiSNSXC35UXmVKMVFdxkp7xmDMc0=,2017-04-01,,,NaT


In [13]:
label_times.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2627879 entries, 0 to 1
Data columns (total 5 columns):
 #   Column         Dtype         
---  ------         -----         
 0   msno           object        
 1   cutoff_time    datetime64[ns]
 2   label          float64       
 3   days_to_churn  float64       
 4   churn_date     datetime64[ns]
dtypes: datetime64[ns](2), float64(2), object(1)
memory usage: 120.3+ MB


In [14]:
label_times = label_times.reset_index()

In [15]:
label_times.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2627879 entries, 0 to 2627878
Data columns (total 6 columns):
 #   Column         Dtype         
---  ------         -----         
 0   index          int64         
 1   msno           object        
 2   cutoff_time    datetime64[ns]
 3   label          float64       
 4   days_to_churn  float64       
 5   churn_date     datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1), object(1)
memory usage: 120.3+ MB


In [24]:
label_times.label.sum()

0.0