# Repeat Buyer Prediction
#### Haopeng Huang, Ashutosh Jha, Muci Yu

## Introduction

Merchants sometimes run big promotions (e.g., discounts or cash coupons) on particular dates (e.g., Boxing-day Sales, "Black Friday" or "Double 11 (Nov 11th)" , in order to attract a large number of new buyers. Unfortunately, many of the attracted buyers are one-time deal hunters, and these promotions may have little long lasting impact on sales. To alleviate this problem, it is important for merchants to identify who can be converted into repeated buyers. By targeting on these potential loyal customers, merchants can greatly reduce the promotion cost and enhance the return on investment (ROI). It is well known that in the field of online advertising, customer targeting is extremely challenging, especially for fresh buyers. However, with the long-term user behavior log accumulated by Tmall.com, we may be able to solve this problem. In this challenge, we provide a set of merchants and their corresponding new buyers acquired during the promotion on the "Double 11" day. Your task is to predict which new buyers for given merchants will become loyal customers in the future. In other words, you need to predict the probability that these new buyers would purchase items from the same merchants again within 6 months. A data set containing around 200k users is given for training, while the other of similar size for testing. Similar to other competitions, you may extract any features, then perform training with additional tools. You need to only submit the prediction results for evaluation. 

[Link to the competition](https://tianchi.aliyun.com/competition/entrance/231576/introduction)

## Exploratory Analysis

In [3]:
#Load Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# Read data
data_format1 = 'Data/data_format1/'
data_format2 = 'Data/data_format2/'

In [13]:
!ls Data/data_format1

test_format1.csv      user_info_format1.csv
train_format1.csv     user_log_format1.csv


In [11]:
user_info1 = pd.read_csv(data_format1+'user_info_format1.csv')
user_info.head()

Unnamed: 0,user_id,age_range,gender
0,376517,6.0,1.0
1,234512,5.0,0.0
2,344532,5.0,0.0
3,186135,5.0,0.0
4,30230,5.0,0.0


In [12]:
user_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 424170 entries, 0 to 424169
Data columns (total 3 columns):
user_id      424170 non-null int64
age_range    421953 non-null float64
gender       417734 non-null float64
dtypes: float64(2), int64(1)
memory usage: 9.7 MB


In [6]:
user_log1 = pd.read_csv(data_format1 + "user_log_format1.csv")
user_log1.head()

Unnamed: 0,user_id,item_id,cat_id,seller_id,brand_id,time_stamp,action_type
0,328862,323294,833,2882,2661.0,829,0
1,328862,844400,1271,2882,2661.0,829,0
2,328862,575153,1271,2882,2661.0,829,0
3,328862,996875,1271,2882,2661.0,829,0
4,328862,1086186,1271,1253,1049.0,829,0


In [12]:
user_log1.describe()

Unnamed: 0,user_id,item_id,cat_id,seller_id,brand_id,time_stamp,action_type
count,54925330.0,54925330.0,54925330.0,54925330.0,54834320.0,54925330.0,54925330.0
mean,212156.8,553861.3,877.0308,2470.941,4153.348,923.0953,0.2854458
std,122287.2,322145.9,448.6269,1473.31,2397.679,195.4305,0.8075806
min,1.0,1.0,1.0,1.0,1.0,511.0,0.0
25%,106336.0,273168.0,555.0,1151.0,2027.0,730.0,0.0
50%,212654.0,555529.0,821.0,2459.0,4065.0,1010.0,0.0
75%,317750.0,830689.0,1252.0,3760.0,6196.0,1109.0,0.0
max,424170.0,1113166.0,1671.0,4995.0,8477.0,1112.0,3.0


In [14]:
!ls Data/data_format2

test_format2.csv  train_format2.csv


In [15]:
train_format2 = pd.read_csv(data_format2 + "train_format2.csv")
train_format2.head()

Unnamed: 0,user_id,age_range,gender,merchant_id,label,activity_log
0,34176,6.0,0.0,944,-1,408895:1505:7370:1107:0
1,34176,6.0,0.0,412,-1,17235:1604:4396:0818:0#954723:1604:4396:0818:0...
2,34176,6.0,0.0,1945,-1,231901:662:2758:0818:0#231901:662:2758:0818:0#...
3,34176,6.0,0.0,4752,-1,174142:821:6938:1027:0
4,34176,6.0,0.0,643,-1,716371:1505:968:1024:3
