## Description:
这次实验依然是用的Criteo数据集， 只不过由于原来的数据量太大， 为了在单机上能够运行， 做了采样， 取了很少的一部分进行实验。数据集位于data/文件夹下， train.csv是训练集， test.csv是测试集。 这个笔记本我们是做数据的读入和预处理操作， 具体步骤如下：
1. 读入数据集， 并进行缺失值的填充， 这里为了简单一些， 直接类别特征填充“-1”， 数值特征填充0
2. 类别特征的编码， 用的LabelEncoder编码， 数值特征的归一化处理
3. 划分开训练集和验证集保存到prepeocessed/文件夹下

## 导入包和数据集

In [34]:
# import packages
import numpy as np
import pandas as pd

from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split

In [35]:
# import data
train_df = pd.read_csv('./data/train.csv')
test_df = pd.read_csv('./data/test.csv')

print(train_df.shape, test_df.shape)

(1599, 41) (400, 40)


In [36]:
# 进行数据合并
label = train_df['Label']
del train_df['Label']

data_df = pd.concat((train_df, test_df))

In [37]:
del data_df['Id']

data_df.columns

Index(['I1', 'I2', 'I3', 'I4', 'I5', 'I6', 'I7', 'I8', 'I9', 'I10', 'I11',
       'I12', 'I13', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9',
       'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19',
       'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26'],
      dtype='object')

In [38]:
# 特征分开类别
sparse_feas = [col for col in data_df.columns if col[0] == 'C']
dense_feas = [col for col in data_df.columns if col[0] == 'I']

In [39]:
# 填充缺失值
data_df[sparse_feas] = data_df[sparse_feas].fillna('-1')
data_df[dense_feas] = data_df[dense_feas].fillna(0)

## 数据预处理

In [40]:
# 进行编码  类别特征编码
for feat in sparse_feas:
    le = LabelEncoder()
    data_df[feat] = le.fit_transform(data_df[feat])

In [41]:
# 数值特征归一化
mms = MinMaxScaler()
data_df[dense_feas] = mms.fit_transform(data_df[dense_feas])

In [42]:
# 分开测试集和训练集
train = data_df[:train_df.shape[0]]
test = data_df[train_df.shape[0]:]

train['Label'] = label

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['Label'] = label


## 划分数据集

In [43]:
train_set, val_set = train_test_split(train, test_size = 0.1, random_state=2020)

In [44]:
train_set.head()

Unnamed: 0,I1,I2,I3,I4,I5,I6,I7,I8,I9,I10,...,C18,C19,C20,C21,C22,C23,C24,C25,C26,Label
1308,0.0,0.000381,0.000473,0.0,0.009449,0.082147,0.004825,0.003656,0.040447,0.0,...,166,116,2,14,0,0,60,27,327,0
935,0.0,0.000127,0.0,0.0,0.075768,0.0,0.0,0.0,0.00071,0.0,...,145,0,0,1166,0,1,469,0,0,0
1596,0.0,0.000381,0.000236,0.137931,0.004804,0.030185,0.007841,0.062157,0.024126,0.0,...,99,0,0,588,0,11,434,0,0,0
1143,0.0,0.011696,0.000473,0.034483,0.00218,0.0,0.0,0.032907,0.003193,0.0,...,188,34,1,862,0,0,67,27,380,1
1419,0.0,0.00445,0.001064,0.034483,0.006119,0.039457,0.001206,0.009141,0.000887,0.0,...,348,0,0,296,0,9,23,0,0,0


In [45]:
train_set['Label'].value_counts()

0    1129
1     310
Name: Label, dtype: int64

In [46]:
val_set['Label'].value_counts()

0    136
1     24
Name: Label, dtype: int64

In [47]:
# 这里把特征列汇总一下
# def denseFeature(feat):  # feat表示数值型特征

In [48]:
# 保存文件
train_set.reset_index(drop=True, inplace=True)
val_set.reset_index(drop=True, inplace=True)

train_set

Unnamed: 0,I1,I2,I3,I4,I5,I6,I7,I8,I9,I10,...,C18,C19,C20,C21,C22,C23,C24,C25,C26,Label
0,0.0,0.000381,0.000473,0.000000,0.009449,0.082147,0.004825,0.003656,0.040447,0.0,...,166,116,2,14,0,0,60,27,327,0
1,0.0,0.000127,0.000000,0.000000,0.075768,0.000000,0.000000,0.000000,0.000710,0.0,...,145,0,0,1166,0,1,469,0,0,0
2,0.0,0.000381,0.000236,0.137931,0.004804,0.030185,0.007841,0.062157,0.024126,0.0,...,99,0,0,588,0,11,434,0,0,0
3,0.0,0.011696,0.000473,0.034483,0.002180,0.000000,0.000000,0.032907,0.003193,0.0,...,188,34,1,862,0,0,67,27,380,1
4,0.0,0.004450,0.001064,0.034483,0.006119,0.039457,0.001206,0.009141,0.000887,0.0,...,348,0,0,296,0,9,23,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1434,0.0,0.028858,0.000591,0.000000,0.001422,0.006900,0.004222,0.005484,0.002129,0.0,...,43,58,2,969,0,1,175,27,150,0
1435,0.0,0.000509,0.000828,0.298851,0.005456,0.000000,0.000000,0.021938,0.001064,0.0,...,42,23,3,871,3,0,115,2,138,0
1436,0.0,0.000381,0.000000,0.000000,0.001604,0.027382,0.004222,0.000000,0.012063,0.0,...,197,0,0,0,0,0,0,0,0,0
1437,0.0,0.000127,0.000000,0.000000,0.039133,0.000000,0.000000,0.036563,0.000000,0.0,...,88,0,0,576,3,0,547,0,0,0


In [49]:
train_set.to_csv('preprocessed_data/train_set.csv', index=0)
val_set.to_csv('preprocessed_data/val_set.csv', index=0)
test.to_csv('preprocessed_data/test.csv', index=0)