# Otto商品分类——对测试数据进行特征工程

Otto数据集是著名电商Otto提供的一个多类商品分类问题，类别数=9. 每个样本有93维数值型特征（整数，可能表示某种事件发生的次数，已经进行过脱敏处理）。 
竞赛官网：https://www.kaggle.com/c/otto-group-product-classification-challenge/data

对测试数据进行

In [1]:
# 首先 import 必要的模块
import pandas as pd 
import numpy as np

## 读取数据

In [2]:
# 读取数据
# path to where the data lies
dpath = './data/'
test = pd.read_csv(dpath +"Otto_test.csv")
test.head()

Unnamed: 0,id,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,...,feat_84,feat_85,feat_86,feat_87,feat_88,feat_89,feat_90,feat_91,feat_92,feat_93
0,1,0,0,0,0,0,0,0,0,0,...,0,0,11,1,20,0,0,0,0,0
1,2,2,2,14,16,0,0,0,0,0,...,0,0,0,0,0,4,0,0,2,0
2,3,0,1,12,1,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,0,...,0,3,1,0,0,0,0,0,0,0
4,5,1,0,0,1,0,0,1,2,0,...,0,0,0,0,0,0,0,9,0,0


In [3]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144368 entries, 0 to 144367
Data columns (total 94 columns):
id         144368 non-null int64
feat_1     144368 non-null int64
feat_2     144368 non-null int64
feat_3     144368 non-null int64
feat_4     144368 non-null int64
feat_5     144368 non-null int64
feat_6     144368 non-null int64
feat_7     144368 non-null int64
feat_8     144368 non-null int64
feat_9     144368 non-null int64
feat_10    144368 non-null int64
feat_11    144368 non-null int64
feat_12    144368 non-null int64
feat_13    144368 non-null int64
feat_14    144368 non-null int64
feat_15    144368 non-null int64
feat_16    144368 non-null int64
feat_17    144368 non-null int64
feat_18    144368 non-null int64
feat_19    144368 non-null int64
feat_20    144368 non-null int64
feat_21    144368 non-null int64
feat_22    144368 non-null int64
feat_23    144368 non-null int64
feat_24    144368 non-null int64
feat_25    144368 non-null int64
feat_26    144368 non-null int6

## 特征工程
测试数据的特征工程和训练数据一样
特征编码的模型由训练集得到，此次将模型装载进来就好

特征变换，这个是体力活
1. 取对数（对线性模型很重要，树模型影响不大）
2. tf-idf
3. 原始特征组合（加减乘除。如果是计数特征，乘法表示“and”，更有意义（FM）；或者可采用GBDT做特征编码，实现更高阶特征组合；原始特征维数太高，也可以先用基础模型得到特征的重要性，对重要的特征再组合）（CTR部分讲解）
4. t-SNE及PCA降维后的特征 （降维部分讲解）
5. 统计特征，如sum of the row, number of non-zero, max of the row，x-mean，个人感觉对这个数据集意义不大

## 分开特征和id

In [4]:
#暂存id，用于保存特征变换后的结果并用于结果提交
test_id = test['id']
# drop ids and get labels
X_test = test.drop(["id"], axis=1)

#保存特征名字
columns_org = X_test.columns

# 1. feat编码：log(x+1)
原始特征feat_x看起来像计数特征，取log运算更接近人对数字的敏感度，更适合线性模型。
同时也可以降低长维分布中大数值的影响，减弱长维分布的长尾性。

In [5]:
X_test_log = np.log1p(X_test)

#重新组成DataFrame
feat_names = columns_org + "_log"
X_test_log = pd.DataFrame(columns = feat_names, data = X_test_log.values)

X_test_log.head()

Unnamed: 0,feat_1_log,feat_2_log,feat_3_log,feat_4_log,feat_5_log,feat_6_log,feat_7_log,feat_8_log,feat_9_log,feat_10_log,...,feat_84_log,feat_85_log,feat_86_log,feat_87_log,feat_88_log,feat_89_log,feat_90_log,feat_91_log,feat_92_log,feat_93_log
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.386294,...,0.0,0.0,2.484907,0.693147,3.044522,0.0,0.0,0.0,0.0,0.0
1,1.098612,1.098612,2.70805,2.833213,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.609438,0.0,0.0,1.098612,0.0
2,0.0,0.693147,2.564949,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.098612,0.0,0.0,0.0,0.0,0.693147
3,0.0,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.386294,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.693147,0.0,0.0,0.693147,0.0,0.0,0.693147,1.098612,0.0,1.386294,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.302585,0.0,0.0


## 2. feat编码：TF-IDF
原始特征feat_x看起来像计数特征，类似文本分析中词频特征的处理，TF-IDF可以突出对特别类别有贡献的低频词。
这里原始特征已经是计数特征了，直接调用TfidfTransformer，将计数特征变成TF-IDF
如果输入是原始文本，需要将计数功能（TF）和IDF功能集中在一起，用TfidfVectorizer

In [6]:
# transform counts to TFIDF features
#from sklearn.feature_extraction.text import TfidfTransformer
#tfidf = TfidfTransformer()

import cPickle
tfidf = cPickle.load(open("tfidf.pkl", 'rb'))

#输出稀疏矩阵
X_test_tfidf = tfidf.transform(X_test).toarray()

#重新组成DataFrame,为了可视化
feat_names = columns_org + "_tfidf"
X_test_tfidf = pd.DataFrame(columns = feat_names, data = X_test_tfidf)

X_test_tfidf.head()

Unnamed: 0,feat_1_tfidf,feat_2_tfidf,feat_3_tfidf,feat_4_tfidf,feat_5_tfidf,feat_6_tfidf,feat_7_tfidf,feat_8_tfidf,feat_9_tfidf,feat_10_tfidf,...,feat_84_tfidf,feat_85_tfidf,feat_86_tfidf,feat_87_tfidf,feat_88_tfidf,feat_89_tfidf,feat_90_tfidf,feat_91_tfidf,feat_92_tfidf,feat_93_tfidf
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18324,...,0.0,0.0,0.411895,0.052224,0.842245,0.0,0.0,0.0,0.0,0.0
1,0.06795,0.078094,0.443016,0.493584,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.122674,0.0,0.0,0.061405,0.0
2,0.0,0.058829,0.572102,0.046477,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.078248,0.0,0.0,0.0,0.0,0.069951
3,0.0,0.0,0.0,0.044693,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.135951,0.033452,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.047877,0.0,0.0,0.043471,0.0,0.0,0.059028,0.079725,0.0,0.159226,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.548938,0.0,0.0


## 3. 其他特征工程
5. 一行的最大值、和、非0元素数目
将这些特征加到原始特征中

In [7]:
#X_test['feat_max'] = X_test.max(axis=1)
#X_test['feat_sum'] = X_test.sum(axis=1)
#X_test['feat_zero_count'] = X_test.apply(lambda x : x.value_counts().get(0,0),axis=1)
#X_test.head()

## 数据预处理
由于数据极度稀疏，数据缩放应采用MinMaxScaler，使得变换后的数据继续保持稀疏。
如果将特征看似词频这种特征，不用缩放，而是用每个样本用模长归一

In [8]:
# 对原始数据缩放
#from sklearn.preprocessing import MinMaxScaler
# 构造输入特征的标准化器
#ms_org = MinMaxScaler()

import cPickle
ms_org = cPickle.load(open("MinMaxSclaer_org.pkl", 'rb'))

#保存特征名字，用于结果保存为csv
feat_names_org = X_test.columns

# 用训练模型训练好的缩放器对测试数据进行特征缩放：transform
X_test = ms_org.transform(X_test)

In [9]:
# 对log数据缩放
#from sklearn.preprocessing import MinMaxScaler
# 构造输入特征的标准化器
#ms_log = MinMaxScaler()

import cPickle
ms_log = cPickle.load(open("MinMaxSclaer_log.pkl", 'rb'))

#保存特征名字，用于结果保存为csv
feat_names_log = X_test_log.columns

# 用训练模型训练好的缩放器对测试数据进行特征缩放：transform
X_test_log = ms_log.transform(X_test_log)

In [10]:
# 对tf-idf数据缩放
#from sklearn.preprocessing import MinMaxScaler
# 构造输入特征的标准化器
#ms_tfidf = MinMaxScaler()

import cPickle
ms_tfidf = cPickle.load(open("MinMaxSclaer_tfidf.pkl", 'rb'))


#保存特征名字，用于结果保存为csv
feat_names_tfidf = X_test_tfidf.columns

# 用训练模型训练好的缩放器对测试数据进行特征缩放：transform
X_test_tfidf = ms_tfidf.transform(X_test_tfidf)

In [11]:
#保存原始特征
feat_names = columns_org
test_org = pd.concat([test_id, pd.DataFrame(columns = feat_names_org, data = X_test)], axis = 1)
test_org.to_csv('./data/'+'Otto_FE_test_org.csv',index=False,header=True)

In [12]:
#保存log特征变换结果
test_log = pd.concat([test_id, pd.DataFrame(columns = feat_names_log, data = X_test_log)], axis = 1)
test_log.to_csv('./data/'+'Otto_FE_test_log.csv',index=False,header=True)

In [13]:
#保存tf-idf特征变换结果
test_tfidf = pd.concat([test_id, pd.DataFrame(columns = feat_names_tfidf, data = X_test_tfidf)], axis = 1)
test_tfidf.to_csv('./data/'+'Otto_FE_test_tfidf.csv',index=False,header=True)