### Background
As smart phone penetration reaches the hundreds of millions mark, O2O (Online to Offline) requires businesses to have a strong presence both offline and online. APPs with O2O capabilities accumulate daily consumer behaviour and location data that require big data and commercial operations management. The competition at hand focuses on coupon redemption rates. Sending coupons is a general O2O marketing tool used to activate existing customers and attract new ones. While customers are happy to receive coupons that they want, they are frustrated when receiving coupons that they do not need. For merchants, sending unwanted coupons may erode brand equity and hinder marketing expense forecasting. Targeted marketing is an important technology to increase the coupon redemption rate, providing relevant discounts to customers and effective marketing tools to businesses. The competition provides participants with abundant O2O data in this field and expects contestants to predict whether the customer will use the coupon within a specified time frame.
### Data
This competition provides real online and offline user consumption data from January 1, 2016 to June 15, 2016. The contestants are expected to predict the probability of customers redeeming a coupon within 15 days of receiving it.
Note: To protect the privacy of users and merchants, data is desensitized and under biased sampling.
### Evaluation
The results are evaluated based on the average AUC value. That is, the AUC value is calculated for every coupon_id. The average of each AUC value is the evaluation score. More information on AUC value calculation method on wikipedia.


In [1]:
import time
now = time.time()

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import re
from datetime import datetime
from ast import literal_eval

import warnings
warnings.filterwarnings("ignore")

### Online and Offline Training data

In [3]:
df_on = pd.read_csv('DataSets/ccf_online_stage1_train.csv')
df_off = pd.read_csv('DataSets/ccf_offline_stage1_train.csv')

In [4]:
print("Online Training Data Sample\nShape:"+str(df_on.shape))
df_on.head()

Online Training Data Sample
Shape:(11429826, 7)


Unnamed: 0,User_id,Merchant_id,Action,Coupon_id,Discount_rate,Date_received,Date
0,13740231,18907,2,100017492.0,500:50,20160513.0,
1,13740231,34805,1,,,,20160321.0
2,14336199,18907,0,,,,20160618.0
3,14336199,18907,0,,,,20160618.0
4,14336199,18907,0,,,,20160618.0


In [5]:
print("Offline Training Data Sample\nShape:"+str(df_off.shape))
df_off.head()

Offline Training Data Sample
Shape:(1754884, 7)


Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date
0,1439408,2632,,,0.0,,20160217.0
1,1439408,4663,11002.0,150:20,1.0,20160528.0,
2,1439408,2632,8591.0,20:1,0.0,20160217.0,
3,1439408,2632,1078.0,20:1,0.0,20160319.0,
4,1439408,2632,8591.0,20:1,0.0,20160613.0,


### Test Data (Offline)

In [6]:
df_test = pd.read_csv('DataSets/ccf_offline_stage1_test_revised.csv')
print("Testing Data(Offline) Sample\nShape:"+str(df_test.shape))
df_test.head()

Testing Data(Offline) Sample
Shape:(113640, 6)


Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received
0,4129537,450,9983,30:5,1.0,20160712
1,6949378,1300,3429,30:5,,20160706
2,2166529,7113,6928,200:20,5.0,20160727
3,2166529,7113,1808,100:10,5.0,20160727
4,6172162,7605,6500,30:1,2.0,20160708


#### Converting Date to DateTime format

In [7]:
#Online Training Data
df_on['Date'] = pd.to_datetime(df_on["Date"],format='%Y%m%d')
df_on['Date_received'] = pd.to_datetime(df_on["Date_received"],format='%Y%m%d')

#Offline Training Data
df_off['Date'] = pd.to_datetime(df_off["Date"],format='%Y%m%d')
df_off['Date_received'] = pd.to_datetime(df_off["Date_received"],format='%Y%m%d')

#Testing Data
df_test['Date_received'] = pd.to_datetime(df_test["Date_received"],format='%Y%m%d')

### Removing Duplicates from Online and Offline Training Data

In [8]:
#Removing duplicates and giving frequency counts(Count) to each row

#Online
x = 'g8h.|$hTdo+jC9^@'    
df_on_unique = (df_on.fillna(x).groupby(['User_id', 'Merchant_id', 'Action', 'Coupon_id', 'Discount_rate',
       'Date_received', 'Date']).size().reset_index()
               .rename(columns={0 : 'Count'}).replace(x,np.NaN))
df_on_unique["Date_received"]=pd.to_datetime(df_on_unique["Date_received"])
df_on_unique["Date"]=pd.to_datetime(df_on_unique["Date"])

print("Online Training Data Shape:"+str(df_on_unique.shape))

Online Training Data Shape:(5822543, 8)


In [9]:
#Offline
x = 'g8h.|$hTdo+jC9^@'   #garbage value for nan values 
df_off_unique = (df_off.fillna(x).groupby(['User_id', 'Merchant_id', 'Coupon_id', 'Discount_rate', 'Distance',
       'Date_received', 'Date']).size().reset_index()
               .rename(columns={0 : 'Count'}).replace(x,np.NaN))
df_off_unique["Date_received"]=pd.to_datetime(df_off_unique["Date_received"])
df_off_unique["Date"]=pd.to_datetime(df_off_unique["Date"])

print("Offline Training Data Shape:"+str(df_off_unique.shape))

Offline Training Data Shape:(1716991, 8)


## Joining Train and test Data

In [10]:
df_data = df_off_unique.append(df_test, sort=False)

#### Filling Nan for Distance (OFFLINE)

In [11]:
df_data['Distance'].fillna(df_data['Distance'].mean(), inplace=True)
df_data['Distance'] = df_data.Distance.astype(int)

#### Adding Date Number

In [12]:
df_data['DateTrack'] = df_data['Date'].copy()
df_data['DateTrack'].fillna(df_data['Date_received'],inplace=True)
df_data['First_day'] = pd.to_datetime('20160101',format='%Y%m%d')
df_data['DayNum'] = df_data['DateTrack'] - df_data['First_day'] 
df_data['DayNum'] = df_data['DayNum'].dt.days.astype('str')
df_data['DayNum'] = pd.to_numeric(df_data['DayNum'],errors="coerce") + 1
df_data =df_data.drop('First_day',axis=1)
df_data.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,Count,DateTrack,DayNum
0,4,1433,8735.0,30:5,10,2016-02-14,NaT,1.0,2016-02-14,45
1,4,1469,2902.0,0.95,10,2016-06-07,NaT,1.0,2016-06-07,159
2,35,3381,1807.0,300:30,0,2016-01-30,NaT,1.0,2016-01-30,30
3,35,3381,9776.0,10:5,0,2016-01-29,NaT,1.0,2016-01-29,29
4,35,3381,11951.0,200:20,0,2016-01-29,NaT,1.0,2016-01-29,29


#### Converting Discount Ratio to Rate

In [13]:
#Funtion to convert discount ratio to discount rate
def convert_discount(discount):
    values = []
    for i in discount:
        if ':' in i:
            i = i.split(':')
            rate = round((int(i[0]) - int(i[1]))/int(i[0]),3)
            values.append([int(i[0]),int(i[1]),rate])
        elif '.' in i:
            i = float(i)
            x = 100*i
            values.append([100,int(100-x),i])
            
    discounts = dict(zip(discount,values))      
    return discounts

In [14]:
#OFFLINE DATA
df_data = df_data[(df_data['Coupon_id'].isna()==False)].copy()
discounts_offline = list(df_data['Discount_rate'].unique())
df_data.loc[:,('Discount')] = df_data.loc[:,('Discount_rate')] 
df_data['Discount_rate'] = df_data['Discount'].map(convert_discount(discounts_offline))
df_data[['Original_price','Discounted_price','Rate']] = pd.DataFrame(df_data.Discount_rate.values.tolist(), index= df_data.index)
df_data.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,Count,DateTrack,DayNum,Discount,Original_price,Discounted_price,Rate
0,4,1433,8735.0,"[30, 5, 0.833]",10,2016-02-14,NaT,1.0,2016-02-14,45,30:5,30,5,0.833
1,4,1469,2902.0,"[100, 5, 0.95]",10,2016-06-07,NaT,1.0,2016-06-07,159,0.95,100,5,0.95
2,35,3381,1807.0,"[300, 30, 0.9]",0,2016-01-30,NaT,1.0,2016-01-30,30,300:30,300,30,0.9
3,35,3381,9776.0,"[10, 5, 0.5]",0,2016-01-29,NaT,1.0,2016-01-29,29,10:5,10,5,0.5
4,35,3381,11951.0,"[200, 20, 0.9]",0,2016-01-29,NaT,1.0,2016-01-29,29,200:20,200,20,0.9


# FEATURES
#### User, Merchant, Coupon, Date Received, Rates

In [15]:
users_level = pd.read_csv('DataSets/DatasetsCreated/user_level.csv')
users_level.head()

Unnamed: 0,User_id,Tag,User_Released,User_Redeemed,User_Ratio,User_Buys,Purchaser,UserMerchantCount,DayList,UserReleaseList,UserRedeemList,User_Redeemed_Buy
0,4,0,3.0,0.0,0.0,1,0,4,[68],"[45, 91, 159]",[],0.0
1,35,1,4.0,0.0,0.0,0,0,1,[],"[29, 29, 30, 30]",[],0.0
2,36,0,2.0,0.0,0.0,1,0,3,[20],"[25, 25]",[],0.0
3,64,0,1.0,0.0,0.0,3,0,3,"[147, 158, 158]",[29],[],0.0
4,110,1,3.0,0.0,0.0,0,0,3,[],"[31, 31, 31]",[],0.0


In [16]:
merchants_level = pd.read_csv('DataSets/DatasetsCreated/merchant_level.csv')
merchants_level.head()

Unnamed: 0,Merchant_id,Merchant_Redeemed,Merchant_Ratio,Merchant_AvgDistance,Merchant_Popular,Merchant_AvgRate,AvgDailyUsers,MerchantBuyList,MerchantReleaseList,MerchantRedeemList,UniqueUsersCount,Merchant_Buys,Merchant_Redeemed_Buy
0,1433,726.0,0.04,3.927992,1,0.810455,49.697802,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[2, 13, 19, 20, 24, 25, 26, 27, 28, 29, 30, 31...","[24, 25, 27, 29, 30, 31, 33, 34, 35, 36, 37, 3...",19340,9045,0.080265
1,1469,675.0,0.05,2.617818,1,0.707819,74.527473,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...","[9, 10, 14, 15, 20, 21, 30, 31, 61, 64, 65, 66...",13702,13576,0.04972
2,3381,2473.0,0.02,2.690394,1,0.866463,119.587912,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[17, 19, 23, 24, 25, 26, 27, 28, 29, 30, 31, 3...","[23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 3...",108018,21829,0.11329
3,1041,402.0,0.05,2.846591,1,0.831778,19.510989,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1...","[7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, ...",7966,3553,0.113144
4,5717,293.0,0.02,2.255078,1,0.751612,21.571429,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[17, 19, 23, 24, 25, 26, 27, 28, 29, 30, 31, 3...","[24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 3...",12506,3927,0.074612


In [17]:
coupons_level = pd.read_csv('DataSets/DatasetsCreated/coupon_level.csv')
coupons_level.head()

Unnamed: 0,Coupon_id,Coupon_Released,Coupon_Redeemed,Coupon_Ratio,Duration,CouponRedeemList,CouponReleaseList,FirstReleaseDate
0,1,5,1,0.2,24,[154],"[134, 143, 151, 155, 158]",134
1,10,32,15,0.47,18,"[136, 138, 139, 140, 141, 142, 143, 145, 146, ...","[134, 135, 136, 137, 138, 139, 140, 141, 142, ...",134
2,100,7,1,0.14,22,[142],"[135, 142, 143, 147, 148, 150, 164]",135
3,1000,38,4,0.11,29,"[31, 32, 35, 37]","[17, 19, 28, 29, 32, 35, 46]",17
4,10000,17,11,0.65,47,"[8, 12, 15, 16, 17, 25, 26, 27, 28, 31, 49]","[2, 5, 7, 8, 11, 12, 15, 16, 17, 21, 24, 25, 2...",2


In [18]:
rates_level = pd.read_csv('DataSets/DatasetsCreated/rate_level.csv')
rates_level.head()

Unnamed: 0,Rate,Rate_Releases,Rate_Redeemed,Rate_Ratio
0,0.2,81,6,0.074074
1,0.333,45003,3920,0.087105
2,0.375,4,4,1.0
3,0.4,11395,2529,0.221939
4,0.5,142664,18480,0.129535


In [19]:
date_level = pd.read_csv('DataSets/DatasetsCreated/date_level.csv')
date_level['Date_received'] =  date_level['Date_received'].astype('datetime64[ns]') 
date_level.head()

Unnamed: 0,Date_received,ReleasesCount,ImpDay,Weekend,DayOfWeek,UniqueReleasesCount
0,2016-01-01,553,0,0,4,136
1,2016-01-02,541,0,1,5,120
2,2016-01-03,529,0,1,6,129
3,2016-01-04,574,0,0,0,130
4,2016-01-05,677,0,0,1,156


## Lists to capture previous interactions

### Merchant-User Buying List

In [20]:
merchant_user_dates = df_off_unique[df_off_unique['Date'].isna()==False]
merchant_user_dates['First_day'] = pd.to_datetime('20160101',format='%Y%m%d')
merchant_user_dates['DayNum'] = merchant_user_dates['Date'] - merchant_user_dates['First_day'] 
merchant_user_dates['DayNum'] = merchant_user_dates['DayNum'].dt.days.astype('str')
merchant_user_dates['DayNum'] = pd.to_numeric(merchant_user_dates['DayNum'],errors="coerce") + 1
merchant_user_dates.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,Count,First_day,DayNum
15,165,2934,,,0.0,NaT,2016-01-11,1,2016-01-01,11
16,165,2934,,,0.0,NaT,2016-01-25,1,2016-01-01,25
17,165,2934,,,0.0,NaT,2016-03-21,1,2016-01-01,81
18,165,2934,,,0.0,NaT,2016-03-28,1,2016-01-01,88
19,165,2934,,,0.0,NaT,2016-04-14,1,2016-01-01,105


In [21]:
merchant_user_days = pd.DataFrame(merchant_user_dates.groupby(['User_id','Merchant_id'])['DayNum']
                                  .apply(list).reset_index(name='Merchant_User_Visit'))


In [22]:
merchant_user_days['Merchant_User_Visit'] = merchant_user_days['Merchant_User_Visit'].apply(lambda x : sorted(set(x)))
merchant_user_days.head()

Unnamed: 0,User_id,Merchant_id,Merchant_User_Visit
0,165,2934,"[11, 25, 81, 88, 105, 131, 154, 169]"
1,165,4195,"[97, 103, 111, 116, 139, 146]"
2,184,3381,[59]
3,209,3267,[157]
4,215,129,[63]


### Merchant User Releases List

In [23]:
merchant_user_releases = df_data[df_data['Date_received'].isna()==False]
merchant_user_releases['First_day'] = pd.to_datetime('20160101',format='%Y%m%d')
merchant_user_releases['DayNum'] = merchant_user_releases['Date_received'] - merchant_user_releases['First_day'] 
merchant_user_releases['DayNum'] = merchant_user_releases['DayNum'].dt.days.astype('str')
merchant_user_releases['DayNum'] = pd.to_numeric(merchant_user_releases['DayNum'],errors="coerce") + 1
merchant_user_releases.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,Count,DateTrack,DayNum,Discount,Original_price,Discounted_price,Rate,First_day
0,4,1433,8735.0,"[30, 5, 0.833]",10,2016-02-14,NaT,1.0,2016-02-14,45,30:5,30,5,0.833,2016-01-01
1,4,1469,2902.0,"[100, 5, 0.95]",10,2016-06-07,NaT,1.0,2016-06-07,159,0.95,100,5,0.95,2016-01-01
2,35,3381,1807.0,"[300, 30, 0.9]",0,2016-01-30,NaT,1.0,2016-01-30,30,300:30,300,30,0.9,2016-01-01
3,35,3381,9776.0,"[10, 5, 0.5]",0,2016-01-29,NaT,1.0,2016-01-29,29,10:5,10,5,0.5,2016-01-01
4,35,3381,11951.0,"[200, 20, 0.9]",0,2016-01-29,NaT,1.0,2016-01-29,29,200:20,200,20,0.9,2016-01-01


In [24]:
merchant_user_releases = pd.DataFrame(merchant_user_releases.groupby(['User_id','Merchant_id'])['DayNum']
                                  .apply(list).reset_index(name='MerchantUserReleaseList'))
merchant_user_releases['MerchantUserReleaseList'] = merchant_user_releases['MerchantUserReleaseList'].apply(lambda x : sorted(x))
merchant_user_releases.head()

Unnamed: 0,User_id,Merchant_id,MerchantUserReleaseList
0,4,1433,[45]
1,4,1469,[159]
2,35,3381,"[29, 29, 30, 30]"
3,36,1041,[25]
4,36,5717,[25]


### Merchant User Redeem List

In [25]:
merchant_user_redeem = df_off_unique[(df_off_unique['Date_received'].isna()==False) 
                                     & (df_off_unique['Date'].isna()==False)]
merchant_user_redeem['First_day'] = pd.to_datetime('20160101',format='%Y%m%d')
merchant_user_redeem['DayNum'] = merchant_user_redeem['Date'] - merchant_user_redeem['First_day'] 
merchant_user_redeem['DayNum'] = merchant_user_redeem['DayNum'].dt.days.astype('str')
merchant_user_redeem['DayNum'] = pd.to_numeric(merchant_user_redeem['DayNum'],errors="coerce") + 1
merchant_user_redeem.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,Count,First_day,DayNum
33,184,3381,9776.0,10:5,0.0,2016-01-29,2016-02-28,1,2016-01-01,59
76,417,775,5435.0,30:5,0.0,2016-03-29,2016-04-12,1,2016-01-01,103
150,687,6454,14031.0,100:10,,2016-01-28,2016-01-30,1,2016-01-01,30
153,687,8594,9353.0,30:1,,2016-03-28,2016-04-02,1,2016-01-01,93
158,696,4195,3726.0,0.9,0.0,2016-04-11,2016-04-13,1,2016-01-01,104


In [26]:
merchant_user_redeem = pd.DataFrame(merchant_user_redeem.groupby(['User_id','Merchant_id'])['DayNum']
                                  .apply(list).reset_index(name='MerchantUserRedeemList'))
merchant_user_redeem['MerchantUserRedeemList'] = merchant_user_redeem['MerchantUserRedeemList'].apply(lambda x : sorted(x))
merchant_user_redeem.head()

Unnamed: 0,User_id,Merchant_id,MerchantUserRedeemList
0,184,3381,[59]
1,417,775,[103]
2,687,6454,[30]
3,687,8594,[93]
4,696,4195,"[104, 144, 145, 151]"


### User Coupon Release List

In [27]:
coupon_user_dates = df_data.copy()
coupon_user_dates['First_day'] = pd.to_datetime('20160101',format='%Y%m%d')
coupon_user_dates['DayNum'] = coupon_user_dates['Date_received'] - coupon_user_dates['First_day'] 
coupon_user_dates['DayNum'] = coupon_user_dates['DayNum'].dt.days.astype('str')
coupon_user_dates['DayNum'] = pd.to_numeric(coupon_user_dates['DayNum'],errors="coerce") + 1
coupon_user_dates.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,Count,DateTrack,DayNum,Discount,Original_price,Discounted_price,Rate,First_day
0,4,1433,8735.0,"[30, 5, 0.833]",10,2016-02-14,NaT,1.0,2016-02-14,45,30:5,30,5,0.833,2016-01-01
1,4,1469,2902.0,"[100, 5, 0.95]",10,2016-06-07,NaT,1.0,2016-06-07,159,0.95,100,5,0.95,2016-01-01
2,35,3381,1807.0,"[300, 30, 0.9]",0,2016-01-30,NaT,1.0,2016-01-30,30,300:30,300,30,0.9,2016-01-01
3,35,3381,9776.0,"[10, 5, 0.5]",0,2016-01-29,NaT,1.0,2016-01-29,29,10:5,10,5,0.5,2016-01-01
4,35,3381,11951.0,"[200, 20, 0.9]",0,2016-01-29,NaT,1.0,2016-01-29,29,200:20,200,20,0.9,2016-01-01


In [28]:
coupon_user_days = pd.DataFrame(coupon_user_dates.groupby(['User_id','Coupon_id'])['DayNum'].apply(list).reset_index(name='Coupon_User_Visit'))


In [29]:
coupon_user_days['Coupon_User_Visit'] = coupon_user_days['Coupon_User_Visit'].apply(lambda x : sorted(set(x)))
coupon_user_days.head()

Unnamed: 0,User_id,Coupon_id,Coupon_User_Visit
0,4,2902.0,[159]
1,4,8735.0,[45]
2,35,1807.0,[30]
3,35,9776.0,[29]
4,35,11951.0,"[29, 30]"


# Adding Features to DataSet

In [30]:
#Adding user level features 
data = df_data.merge(users_level,how='left',on='User_id')
print(data.shape[0])

#Adding merchant level features 
data = pd.merge(data, merchants_level, how='left', on='Merchant_id')
print(data.shape[0])

#Adding coupon level features 
data = pd.merge(data, coupons_level, how='left', on='Coupon_id')
print(data.shape[0])

#Adding date received level features 
data = pd.merge(data, date_level, how='left', on='Date_received')
print(data.shape[0])

# Adding merchant user buy list
data = pd.merge(data,merchant_user_days,how='left', on=['User_id','Merchant_id'])
print(data.shape[0])
    
# Adding coupon user release list
data = pd.merge(data,coupon_user_days,how='left', on=['User_id','Coupon_id'])
print(data.shape[0])

    
# Adding merchant user release list
data = pd.merge(data,merchant_user_releases,how='left',  on=['User_id','Merchant_id'])
print(data.shape[0])
    
# Adding merchant user redeem list
data = pd.merge(data,merchant_user_redeem,how='left',  on=['User_id','Merchant_id'])
print(data.shape[0])

    
for i in ['Coupon_User_Visit','Merchant_User_Visit','MerchantUserRedeemList','MerchantUserReleaseList']:
    for row in data.loc[data[i].isnull(), i].index:
        data.at[row, i] = []

for i in ['UserReleaseList','UserRedeemList','DayList','MerchantReleaseList','MerchantRedeemList',
          'MerchantBuyList','CouponReleaseList','CouponRedeemList']:
    for row in data.loc[data[i].isnull(), i].index:
        data.at[row, i] = '[]'



1129029
1129029
1129029
1129029
1129029
1129029
1129029
1129029


In [31]:
# data.columns

### Function to find the last visit or interaction

In [32]:
def lastVisit(days, x, check):
    if check:
        days = literal_eval(days)
    n = len(days)
    if n==0:
        return np.nan
    x = int(x)
    if x in days:
        try:
            i = days.index(x)
            if i == 0:
                return np.nan
            return days[i]-days[i-1] 
        except IndexError:
            return np.nan
    if x > days[n-1]:
        return x - days[n-1]
    elif x < days[0]:
        return np.nan
    else:
        for i in range(n):
            if (days[i]>x) & (i>=1):
                return x - days[i-1]
    return np.nan

### Function to calculate the total visits or interactions

In [33]:
def lastCount(days, x, check):
    if check:
        days = literal_eval(days)
    n = len(days)
    if n==0:
        return 0
    x = int(x)
    if x in days:
        try:
            i = days.index(x)
            return i+1 
        except IndexError:
            return 0
    if x > days[n-1]:
        return n
    elif x < days[0]:
        return 0
    else:
        for d in range(n):
            if (days[d]>x):
                return d+1
    return 0

# Time Based Features

## Time based user level features

### User and Number of coupons given to the user

In [34]:
data['UserReleasesTime'] = [lastCount(x,y, True) for (x,y) in 
                                   zip(data['UserReleaseList'],data['DayNum'])]

### User and Number of coupons redeemed by the user

In [35]:
data['UserRedeemTime'] = [lastCount(x,y, True) for (x,y) in 
                                   zip(data['UserRedeemList'],data['DayNum'])]

### User and total purchases made by the user

In [36]:
data['UserBuysTime'] = [lastCount(x,y, True) for (x,y) in 
                                   zip(data['DayList'],data['DayNum'])]

### User Redeem/Buy ratio

In [37]:
data['UserRedeemedBuyRatioTime'] = [x/y if y!=0 else 0 for x,y 
                                        in zip(data['UserRedeemTime'],data['UserBuysTime'])]

### User Redeem/Buy ratio

In [38]:
data['UserRatioTime'] = [x/y if y!=0 else 0 for x,y 
                                        in zip(data['UserRedeemTime'],data['UserReleasesTime'])]

### User and its last visit

In [39]:
data['LastUserVisit'] = [lastVisit(x,y, True) for (x,y) in zip(data['DayList'],data['DayNum'])]

## Time based Merchant level features

### Merchant and Number of coupons released by the merchant

In [40]:
data['MerchantReleasesTime'] = [lastCount(x,y, True) for (x,y) in 
                                   zip(data['MerchantReleaseList'],data['DayNum'])]

### Merchant and Number of coupons redeemed of that merchant

In [41]:
data['MerchantRedeemTime'] = [lastCount(x,y, True) for (x,y) in 
                                   zip(data['MerchantRedeemList'],data['DayNum'])]

### Merchant and Total purchases of that merchant

In [42]:
data['MerchantBuysTime'] = [lastCount(x,y, True) for (x,y) in 
                                   zip(data['MerchantBuyList'],data['DayNum'])]

### Merchant Redeem/Buy ratio

In [43]:
data['MerchantRedeemedBuyRatioTime'] = [x/y if y!=0 else 0 for x,y 
                                        in zip(data['MerchantRedeemTime'],data['MerchantBuysTime'])]

### Merchant Redeem/Release ratio

In [44]:
data['MerchantRatioTime'] = [x/y if y!=0 else 0 for x,y 
                                        in zip(data['MerchantRedeemTime']
                                               ,data['MerchantReleasesTime'])]

### Merchant and its last visiting window

In [45]:
data['LastMerchantVisit'] = [lastVisit(x,y, True) for (x,y) in 
                                      zip(data['MerchantBuyList'],data['DayNum'])]

In [46]:
data['LastMerchantRedemption'] = [lastVisit(x,y, True) for (x,y) in 
                                           zip(data['MerchantRedeemList'],data['DayNum'])]

### Merchant-User and its last buying window

In [47]:
data['LastMerchantUserVisit'] = [lastVisit(x,y, False) for (x,y) in 
                                          zip(data['Merchant_User_Visit'],data['DayNum'])]

## Time based Coupon level features

### Number of coupons released

In [48]:
data['CouponReleasesTime'] = [lastCount(x,y, True) for (x,y) in 
                                   zip(data['CouponReleaseList'],data['DayNum'])]

### Number of coupons redeemed

In [49]:
data['CouponRedeemTime'] = [lastCount(x,y, True) for (x,y) in 
                                   zip(data['CouponRedeemList'],data['DayNum'])]

### Coupon Redeemed/Redeemed Ratio

In [50]:
data['CouponRatioTime'] = [x/y if y!=0 else 0 for x,y 
                                        in zip(data['CouponRedeemTime']
                                               ,data['CouponReleasesTime'])]

### Coupon Duration Time

In [51]:
data['CouponDurationTime'] = data['DayNum']-data['FirstReleaseDate']

## Time based Merchant-User features

### Number of coupons released

In [52]:
data['MerchantUserReleasesTime'] = [lastCount(x,y, False) for (x,y) in 
                                   zip(data['MerchantUserReleaseList'],data['DayNum'])]

### Number of coupons redeemed

In [53]:
data['MerchantUserRedeemTime'] = [lastCount(x,y, False) for (x,y) in 
                                   zip(data['MerchantUserRedeemList'],data['DayNum'])]

### Coupon Redeemed/Redeemed Ratio

In [54]:
data['MerchantUserRatioTime'] = [x/y if y!=0 else 0 for x,y 
                                        in zip(data['MerchantUserRedeemTime']
                                               ,data['MerchantUserReleasesTime'])]

## Coupon-User and its last redeeming window

In [55]:
data['LastCouponUserVisit']=[lastVisit(x,y, False) for (x,y) in 
                                          zip(data['Coupon_User_Visit'],data['DayNum'])]

# Adding varaiables to track the first time users, merchants and coupons

In [56]:
data['FirstTimeUser'] = [1 if x==x else 0 for x in data['LastUserVisit']]
data['FirstTimeMerchant'] =[1 if x==x else 0 for x in data['LastMerchantVisit']]
# data['FirstTimeCoupon'] = [1 if x==x else 0 for x in data['LastRedemption']]
data['FirstTimeMerchantUser'] = [1 if x==x else 0 for x in data['LastMerchantUserVisit']]
data['FirstTimeCouponUser'] =[1 if x==x else 0 for x in data['LastCouponUserVisit']]

# Recency Variables

## User Level Recency Variables

### Recent Buys of User

In [57]:
data['UserBuysRecent'] = [lastCount(x,y, True)-lastCount(x,y-15, True) if y>15 else 0 for (x,y) in 
                                   zip(data['DayList'],data['DayNum'])]

### Recent Releases to User

In [58]:
data['UserReleasesRecent'] = [lastCount(x,y, True)-lastCount(x,y-15, True) if y>15 else 0 for (x,y) in 
                                   zip(data['UserReleaseList'],data['DayNum'])]

### Recent Redemption by User

In [59]:
data['UserRedeemRecent'] = [lastCount(x,y, True)-lastCount(x,y-15, True) if y>15 else 0 for (x,y) in 
                                   zip(data['UserRedeemList'],data['DayNum'])]

## Merchant Level Recency Variables

### Recent Buys from Merchant

In [60]:
data['MerchantBuysRecent'] = [lastCount(x,y, True)-lastCount(x,y-15, True) if y>15 else 0 for (x,y) in 
                                   zip(data['MerchantBuyList'],data['DayNum'])]

### Recent Releases from Merchant

In [61]:
data['MerchantReleasesRecent'] = [lastCount(x,y, True)-lastCount(x,y-15, True) if y>15 else 0 for (x,y) in 
                                   zip(data['MerchantReleaseList'],data['DayNum'])]

### Recent Redemption from Merchant

In [62]:
data['MerchantRedeemRecent'] = [lastCount(x,y, True)-lastCount(x,y-15, True) if y>15 else 0 for (x,y) in 
                                   zip(data['MerchantRedeemList'],data['DayNum'])]

## Coupon Level Recency Variables

### Recent Releases Coupon

In [63]:
data['CouponReleasesRecent'] = [lastCount(x,y, True)-lastCount(x,y-15, True) if y>15 else 0 for (x,y) in 
                                   zip(data['CouponReleaseList'],data['DayNum'])]

### Recent Redemption Coupon

In [64]:
data['CouponRedeemRecent'] = [lastCount(x,y, True)-lastCount(x,y-15, True) if y>15 else 0 for (x,y) in 
                                   zip(data['CouponRedeemList'],data['DayNum'])]

## Merchant User Level Recency Variables

### Recent Merchant-User Releases

In [65]:
data['MerchantUserReleasesRecent'] = [lastCount(x,y, False)-lastCount(x,y-15, False) if y>15 else 0 for (x,y) in 
                                   zip(data['MerchantUserReleaseList'],data['DayNum'])]

### Recent Merchant-User Redeem

In [66]:
data['MerchantUserRedeemRecent'] = [lastCount(x,y, False)-lastCount(x,y-15, False) if y>15 else 0 for (x,y) in 
                                   zip(data['MerchantUserRedeemList'],data['DayNum'])]

## User-Merchant Relationship (New or Old)

In [67]:
#New Merchant Old User
data['NewMerchantOldUser'] = [1 if (x==1) & (y==0) else 0 for x,y in 
                         zip(data['FirstTimeMerchant'],data['FirstTimeUser'])]

In [68]:
#Old Merchant New User
data['OldMerchantNewUser'] = [1 if (x==0) & (y==1) else 0 for x,y in 
                         zip(data['FirstTimeMerchant'],data['FirstTimeUser'])]

### Recent New Users added to merchants

In [69]:
def recentNewMerchants(u_id, day):
    recent = day-15
    temp_data = data[(data['User_id']==u_id) & (data['DayNum']<day)]

    new_merchants = temp_data[temp_data['DayNum']>recent]['FirstTimeMerchantUser'].sum()
    overall_merchants = temp_data['Merchant_id'].nunique()

    return [new_merchants, overall_merchants]

In [70]:
# data['UsersRecentMerchants'] = [recentNewMerchants(x,y) for x, y in 
#                                         zip(data['User_id'], data['DayNum'])]


In [71]:
# data[['UserNewMerchantsRecent','UserTotalMerchantsRecent']] = pd.DataFrame(data.UsersRecentMerchants.values.tolist(), index= data.index)
# data.head()

# Separating Train and Test Data

In [72]:
train_dataset = data[data['DayNum']<=182] 
test_dataset  = data[data['DayNum']>182]

## Adding Target Label

In [73]:
train_dataset['Date'].fillna(pd.to_datetime('20161201',format='%Y%m%d'),inplace=True)
train_dataset['RedemptionDuration'] = train_dataset['Date'] - train_dataset['Date_received']
train_dataset['RedemptionDuration'] = train_dataset['RedemptionDuration'].dt.days.astype('str')
train_dataset['RedemptionDuration'] = pd.to_numeric(train_dataset['RedemptionDuration'],errors="coerce")
train_dataset['Target'] = [1 if x<=15 else 0 for x in train_dataset['RedemptionDuration']]
train_dataset.shape

(1015389, 93)

In [74]:
print('Percentage of positive labels in training data: ')
print(str(round((train_dataset[train_dataset['Target']==1].shape[0]/train_dataset.shape[0])* 100, 2))+"%")
print(train_dataset[train_dataset['Target']==1].shape[0])

Percentage of positive labels in training data: 
6.12%
62153


In [75]:
test_dataset = test_dataset.drop('Date', axis=1)

### Saving Train and test datasets

In [76]:
train_dataset.to_csv('DataSets/DatasetsCreated/train_dataset.csv',index=False) 

In [77]:
test_dataset.to_csv('DataSets/DatasetsCreated/test_dataset.csv',index=False) 