# Repeat Buyer Prediction
#### Haopeng Huang, Ashutosh Jha, Muci Yu

## Introduction

Merchants sometimes run big promotions (e.g., discounts or cash coupons) on particular dates (e.g., Boxing-day Sales, "Black Friday" or "Double 11 (Nov 11th)" , in order to attract a large number of new buyers. Unfortunately, many of the attracted buyers are one-time deal hunters, and these promotions may have little long lasting impact on sales. To alleviate this problem, it is important for merchants to identify who can be converted into repeated buyers. By targeting on these potential loyal customers, merchants can greatly reduce the promotion cost and enhance the return on investment (ROI). It is well known that in the field of online advertising, customer targeting is extremely challenging, especially for fresh buyers. However, with the long-term user behavior log accumulated by Tmall.com, we may be able to solve this problem. In this challenge, we provide a set of merchants and their corresponding new buyers acquired during the promotion on the "Double 11" day. Your task is to predict which new buyers for given merchants will become loyal customers in the future. In other words, you need to predict the probability that these new buyers would purchase items from the same merchants again within 6 months. A data set containing around 200k users is given for training, while the other of similar size for testing. Similar to other competitions, you may extract any features, then perform training with additional tools. You need to only submit the prediction results for evaluation. 

[Link to the competition](https://tianchi.aliyun.com/competition/entrance/231576/introduction)

## High Level Workflow

### Data Cleaning/Preprocessing
    - Missing value in user_info.age_range
    - Group user_id and mercent_id into one single attribute
    - Create dummy variables for categorical data
    - Feature Selection Through Tree Classifiers
    - Feature Engineering (for every [user, merchant] combination):
        1. Average number of {click, save, purchase} activity 
        2. for each {brand, category}
        3. weighted by his with other merchants

## Exploratory Analysis

In [1]:
#Load Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Read data
data_format1 = 'Data/data_format1/'
data_format2 = 'Data/data_format2/'

In [3]:
!ls Data/data_format1

test_format1.csv      user_info_format1.csv
train_format1.csv     user_log_format1.csv


In [4]:
# Import Data
user_info1 = pd.read_csv(data_format1+'user_info_format1.csv')
user_log1 = pd.read_csv(data_format1 + "user_log_format1.csv")
train_format1 = pd.read_csv(data_format1 + "train_format1.csv")
test_format1 = pd.read_csv(data_format1 + "test_format1.csv")

## Data Cleaning

**How to deal with missing value?**

In [9]:
data_missing = (user_info1.age_range.isnull() | user_info1.gender.isnull())
data_total = user_info1.size
print('There are in total {} rows (out of {}) that contain missing value.'.format(data_missing.sum(),data_total))
print('Data lost if we drop all the users with missing data :')
print('{:.3f} of the total user info data we have.'.format(data_missing.sum()/data_total))

There are in total 6462 rows (out of 1272510) that contain missing value.
Data lost if we drop all the users with missing data :
0.005 of the total user info data we have.


This instances with missing values are only a small part of our data, so let's drop them for now. May come back later to discover some other ways to deal with missing value.

In [11]:
# user_info_ : user_info_ df after dropping missing values
user_info_ = user_info1[-data_missing]
print(user_info_.isnull().sum())
print("Remaining dataframe size: ", user_info_.size)
print("Remaining number of unique users: ", user_info_.user_id.size)
user_info_.head()

user_id      0
age_range    0
gender       0
dtype: int64
Remaining dataframe size:  1253124
Remaining number of unique users:  417708


Unnamed: 0,user_id,age_range,gender
0,376517,6.0,1.0
1,234512,5.0,0.0
2,344532,5.0,0.0
3,186135,5.0,0.0
4,30230,5.0,0.0


**Dataframes to create:**

    1.df_train: user_id, merchant_id, user_merchant, etc
            - Map from train_format1 with user_merchant
    2.df_test: user_id, merchant_id, user_merchant, etc
            - Map from test_format1 with user_merchant

In [23]:
train_format1['user_merchant'] = train_format1.user_id.astype(str)+'_'+train_format1.merchant_id.astype(str)
test_format1['user_merchant'] = test_format1.user_id.astype(str)+'_'+test_format1.merchant_id.astype(str)
user_log1['user_merchant'] = user_log1.user_id.astype(str)+'_'+user_log1.seller_id.astype(str)

In [75]:
def check_element(element, array):
    for x in array:
        if x == element:
            return True
    return False

In [76]:
train_user_log_mask =[check_element(x, train_format1) for x in user_log1.user_merchant]
user_log_train = user_log1[train_user_log_mask]
user_log_train.head()

Unnamed: 0,user_id,item_id,cat_id,seller_id,brand_id,time_stamp,action_type,user_merchant


In [141]:
user_log1.user_merchant.size*0.0001

5492.533

In [140]:
import time

start_time = time.clock()

def check_element(array1, array2):
    '''
    Check for each element in array1 if it exists in array2
    Return a True/False list with the length of array1
    '''
    prev = None
    result = []
    for x in array1:
        not_found = True
        if x == prev:
            result.append(result[-1])
            prev = x
        else:
            for i in range(len(array2)):
                if array2.iloc[i] == x:
                    result.append(True)
                    array2.pop(i)
                    not_found = False
                    break 
            if not_found:
                result.append(False)
    return result

train_user_log_mask = check_element(user_log1.user_merchant[:5492], train_format1.user_merchant[:5000])
print(time.clock() - start_time, "seconds")

230.42658600000004 seconds


In [134]:
train_format1.user_merchant.iloc[222878]

'365840_3979'

In [None]:
train_format1.user_merchant.[222878]

In [146]:
train_user_log_mask

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,


In [145]:
np.array(train_user_log_mask).sum()

0

**Who are the best selling merchants?**

In [65]:
print('66012_1752' in train_format1.user_merchant)
print(66012 in test_format1.user_id)
print(1752 in test_format1.merchant_id)

False
True
True


In [63]:
train_format1[train_format1.user_id == 66012]

Unnamed: 0,user_id,merchant_id,label,user_merchant
31305,66012,1752,0,66012_1752


In [66]:
test_format1[test_format1.user_id == 66012]

Unnamed: 0,user_id,merchant_id,prob,user_merchant


In [81]:
import time

start_time = time.clock()

'66012_1752' in test_format1.user_merchant

print(time.clock() - start_time, "seconds")

0.00034099999993486563 seconds


In [143]:
for x in test_format1.user_merchant:
    if x == '66012_1752':
        print(yes)

In [144]:
'66012_1752' in test_format1.user_merchant

False

In [83]:
array = [i for i in range(12)]
array

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [84]:
array.pop(3)

3

In [85]:
array

[0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11]

In [86]:
array.append(True)

In [87]:
array

[0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, True]

In [121]:
array = [i for i in range(12)]
results = []
array

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [122]:
for i in range(len(array)):
    print("Now is ", i)
    if array[i] == 8:
        results.append(True)
        array.pop(i)
        print('Found it')
        break


Now is  0
Now is  1
Now is  2
Now is  3
Now is  4
Now is  5
Now is  6
Now is  7
Now is  8
Found it
