## pre_process.ipynb

- 这是任务一和任务二的数据预处理代码

## 生成数据描述

1. ccf_off_test.csv

- 由 dataset_raw/ccf_offline_stage1_test_revised.csv 生成

- 新增列：

| no_distance      | is_full_discount | discount_x      | discount_y      | discount_rate    | discount_type          |
| ---------------- | ---------------- | --------------- | --------------- | ---------------- | ---------------------- |
| 是否没有距离信息 | 是否是满减优惠券 | 满减时满多少RMB | 满减时减多少RMB | 优惠券等价折扣率 | 优惠券种类硬编码(17种) |

2. ccf_off_train_csv

- 由 dataset_raw/ccf_offline_stage1_train.csv 生成

- 在 `1` 的基础上新增列:

| normal_consume                     | coupon_consume                     | no_consume                 |
| ---------------------------------- | ---------------------------------- | -------------------------- |
| 是否是没有使用优惠券消费(正常消费) | 是否是使用了优惠券消费(无15天限制) | 是否是领了优惠券但没有消费 |

---

> 三者类似独热编码，只有也一定会有一个是 1，剩下两个是 0
> (没有领优惠券并且没有消费不需要记录)

3. ccf_on_train.csv

- 由 dataset_raw/ccf_online_stage1_train.csv 生成

- 与 `2` 相比，新增特征有如下不同

  - fixed_consume: 是否是限时降价的消费 (限时降价的数据都消费了)
  - is_click: 是否是点击行为
  - normal_consume，coupon_consume，no_consume，fixed_consume 五者之和为 1
    - 点击行为没有优惠券
    - 没有优惠券的不一定是点击行为，还有一个正常消费行为
  - discount_rate 为 -1.0 时表示限时降价消费
    - 点击和正常购买时 discount_rate 都等于 1.0

In [2]:
import pandas as pd
import numpy as np

no_date = pd.to_datetime(0)  # 时间戳零点

In [2]:
def pre_process_off_new(df: pd.DataFrame):
    """线下训练数据集，有 Date"""
    if 'Date' in df.columns:  
        df['normal_consume'] = 0  # 加上是否是正常消费
        df.loc[df['Coupon_id'].isna() & df['Date'].notna(), 'normal_consume'] = 1
        df['coupon_consume'] = 0  # 是否是使用优惠券消费 (没有15天限制)
        df.loc[df['Coupon_id'].notna() & df['Date'].notna(), 'coupon_consume'] = 1
        df['no_consume'] = 0  # 领了优惠券但没有消费
        df.loc[df['Coupon_id'].notna() & df['Date'].isna(), 'no_consume'] = 1
        df['Coupon_id'] = df['Coupon_id'].fillna(0).astype(int) # Coupon_id 由 nullable 转换成 notnull 会把整数类型转成 float，这里转回去
        df['Discount_rate'].fillna('1.0', inplace=True)  # 没有优惠券消费相当于10折，这里得填 str 下面类型才不会出问题
        # df['Distance'].fillna(-1, inplace=True)  Distance 下面就可以处理
        df['Date_received'].fillna(no_date, inplace=True)
        df['Date'].fillna(no_date, inplace=True)
    
    '''
    线下测试数据集处理，没有 Date
    '''
    df['Distance'] = df['Distance'].fillna(-1).astype(int)
    df['no_distance'] = (df['Distance'] == -1).astype(int)
    df['is_full_discount'] = df['Discount_rate'].str.contains(':').astype(int)
    df[['discount_x', 'discount_y']] = df[df['is_full_discount'] == 1]['Discount_rate']\
        .str.split(':', expand=True).astype(float)
     # expand 设置成 true 才可以返回一个 dataframe，设置成 float 是因为合并时有NA
    df['discount_rate'] = (1 - (df['discount_y'] / df['discount_x']))\
        .fillna(df['Discount_rate']).astype(float)
    df[['discount_x', 'discount_y']] = \
        df[['discount_x', 'discount_y']].fillna(-1).astype(int)
    _rate = sorted(set(df.discount_rate))  # 枚举折扣率的种类
    df['discount_type'] = df['discount_rate'].apply(lambda x: _rate.index(x))
    return df

In [3]:
test_df = pd.read_csv('./dataset_raw/ccf_offline_stage1_test_revised.csv', parse_dates=['Date_received'])
train_off_df = pd.read_csv('./dataset_raw/ccf_offline_stage1_train.csv', parse_dates=['Date_received', 'Date'])

In [4]:
out_test_df = pre_process_off_new(test_df)
out_train_off_df = pre_process_off_new(train_off_df)

In [5]:
out_test_df.notna().all().all(), out_train_off_df.notna().all().all()

(True, True)

In [6]:
out_test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113640 entries, 0 to 113639
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   User_id           113640 non-null  int64         
 1   Merchant_id       113640 non-null  int64         
 2   Coupon_id         113640 non-null  int64         
 3   Discount_rate     113640 non-null  object        
 4   Distance          113640 non-null  int64         
 5   Date_received     113640 non-null  datetime64[ns]
 6   no_distance       113640 non-null  int64         
 7   is_full_discount  113640 non-null  int64         
 8   discount_x        113640 non-null  int64         
 9   discount_y        113640 non-null  int64         
 10  discount_rate     113640 non-null  float64       
 11  discount_type     113640 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(9), object(1)
memory usage: 10.4+ MB


In [7]:
out_test_df

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,no_distance,is_full_discount,discount_x,discount_y,discount_rate,discount_type
0,4129537,450,9983,30:5,1,2016-07-12,0,1,30,5,0.833333,7
1,6949378,1300,3429,30:5,-1,2016-07-06,1,1,30,5,0.833333,7
2,2166529,7113,6928,200:20,5,2016-07-27,0,1,200,20,0.900000,10
3,2166529,7113,1808,100:10,5,2016-07-27,0,1,100,10,0.900000,10
4,6172162,7605,6500,30:1,2,2016-07-08,0,1,30,1,0.966667,14
...,...,...,...,...,...,...,...,...,...,...,...,...
113635,5828093,5717,10418,30:5,10,2016-07-16,0,1,30,5,0.833333,7
113636,6626813,1699,7595,30:1,-1,2016-07-07,1,1,30,1,0.966667,14
113637,6626813,7321,7590,50:5,-1,2016-07-12,1,1,50,5,0.900000,10
113638,4547069,760,13602,30:5,0,2016-07-17,0,1,30,5,0.833333,7


In [8]:
out_train_off_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 16 columns):
 #   Column            Dtype         
---  ------            -----         
 0   User_id           int64         
 1   Merchant_id       int64         
 2   Coupon_id         int64         
 3   Discount_rate     object        
 4   Distance          int64         
 5   Date_received     datetime64[ns]
 6   Date              datetime64[ns]
 7   normal_consume    int64         
 8   coupon_consume    int64         
 9   no_consume        int64         
 10  no_distance       int64         
 11  is_full_discount  int64         
 12  discount_x        int64         
 13  discount_y        int64         
 14  discount_rate     float64       
 15  discount_type     int64         
dtypes: datetime64[ns](2), float64(1), int64(12), object(1)
memory usage: 214.2+ MB


In [9]:
out_train_off_df

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,normal_consume,coupon_consume,no_consume,no_distance,is_full_discount,discount_x,discount_y,discount_rate,discount_type
0,1439408,2632,0,1.0,0,1970-01-01,2016-02-17,1,0,0,0,0,-1,-1,1.000000,19
1,1439408,4663,11002,150:20,1,2016-05-28,1970-01-01,0,0,1,0,1,150,20,0.866667,11
2,1439408,2632,8591,20:1,0,2016-02-17,1970-01-01,0,0,1,0,1,20,1,0.950000,14
3,1439408,2632,1078,20:1,0,2016-03-19,1970-01-01,0,0,1,0,1,20,1,0.950000,14
4,1439408,2632,8591,20:1,0,2016-06-13,1970-01-01,0,0,1,0,1,20,1,0.950000,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1754879,212662,3532,0,1.0,1,1970-01-01,2016-03-22,1,0,0,0,0,-1,-1,1.000000,19
1754880,212662,3021,3739,30:1,6,2016-05-08,2016-06-02,0,1,0,0,1,30,1,0.966667,15
1754881,212662,2934,0,1.0,2,1970-01-01,2016-03-21,1,0,0,0,0,-1,-1,1.000000,19
1754882,752472,7113,1633,50:10,6,2016-06-13,1970-01-01,0,0,1,0,1,50,10,0.800000,8


In [10]:
(out_train_off_df['normal_consume']
 + out_train_off_df['coupon_consume']
 + out_train_off_df['no_consume'] == 1).all()
# 类似独热编码

True

In [11]:
out_test_df.to_csv('./dataset_cleaned/ccf_off_test.csv', index=None)
out_train_off_df.to_csv('./dataset_cleaned/ccf_off_train.csv', index=None)

- 线上特征处理

In [3]:
def pre_process_online(df: pd.DataFrame):
    df['coupon_consume'] = 0  # 是否使用了优惠券消费
    df.loc[df['Date'].notna() & df['Coupon_id'].notna() & (df['Coupon_id'] != 'fixed'),\
        'coupon_consume'] = 1
    df['fixed_consume'] = 0  # 是否是限时降价消费
    df.loc[df['Date'].notna() & (df['Coupon_id'] == 'fixed'), 'fixed_consume'] = 1
    # 移除 fixed 的 Date_received
    df.loc[df['Date'].notna() & (df['Coupon_id'] == 'fixed'), 'Date_received'] = no_date
    df['normal_consume'] = 0  # 是否是正常消费，没有使用优惠券的消费行为
    df.loc[(df['Action'] == 1) & df['Coupon_id'].isna(), 'normal_consume'] = 1
    df['no_consume'] = 0  # 是否是领取了优惠券但没有消费
    df['is_click'] = (df['Action'] == 0).astype(int)  # 是否是点击行为
    df.loc[(df['Action'] == 2), 'no_consume'] = 1
    df['Date'].fillna(no_date, inplace=True)
    df['Date_received'].fillna(no_date, inplace=True)
    df['Coupon_id'] = df['Coupon_id'].replace('fixed', 0)
    df['Coupon_id'] = df['Coupon_id'].fillna(0).astype(int)
    df['Discount_rate'] = df['Discount_rate'].replace('fixed', '-1.0') # 标记为 -1.0
    df['Discount_rate'].fillna('1.0', inplace=True)
    df['is_full_discount'] = df['Discount_rate'].str.contains(':').astype(int)
    df[['discount_x', 'discount_y']] = df[df['is_full_discount'] == 1]['Discount_rate']\
        .str.split(':', expand=True).astype(float)
     # expand 设置成 true 才可以返回一个 dataframe，设置成 float 是因为合并时有NA
    df['discount_rate'] = (1 - (df['discount_y'] / df['discount_x']))\
        .fillna(df['Discount_rate']).astype(float)
    df[['discount_x', 'discount_y']] = \
        df[['discount_x', 'discount_y']].fillna(-1).astype(int)
    _rate = sorted(set(df.discount_rate))
    df['discount_type'] = df['discount_rate'].apply(lambda x: _rate.index(x))
    return df

In [4]:
on_data = pd.read_csv('./dataset_raw/ccf_online_stage1_train.csv', parse_dates=['Date', 'Date_received'])

In [5]:
out_on_data = pre_process_online(on_data)

In [7]:
out_on_data.Date_received.describe()

  out_on_data.Date_received.describe()


count                11429826
unique                    168
top       1970-01-01 00:00:00
freq                 10689015
first     1970-01-01 00:00:00
last      2016-06-15 00:00:00
Name: Date_received, dtype: object

In [8]:
out_on_data.notna().all()

User_id             True
Merchant_id         True
Action              True
Coupon_id           True
Discount_rate       True
Date_received       True
Date                True
coupon_consume      True
fixed_consume       True
normal_consume      True
no_consume          True
is_click            True
is_full_discount    True
discount_x          True
discount_y          True
discount_rate       True
discount_type       True
dtype: bool

In [9]:
(out_on_data['normal_consume']
 + out_on_data['coupon_consume']
 + out_on_data['no_consume']
 + out_on_data['fixed_consume']
 + out_on_data['is_click'] == 1).all()

True

In [10]:
out_on_data.to_csv('./dataset_cleaned/ccf_on_train.csv', index=None)