# Otto商品分类——Logistic 回归，测试

我们以Kaggle 2015年举办的Otto Group Product Classification Challenge竞赛数据为例，分别调用
缺省参数LogisticRegression、
LogisticRegression + GridSearchCV （可用LogisticRegressionCV代替）进行参数调优。

Otto数据集是著名电商Otto提供的一个多类商品分类问题，类别数=9. 每个样本有93维数值型特征（整数，表示某种事件发生的次数，已经进行过脱敏处理）。 竞赛官网：https://www.kaggle.com/c/otto-group-product-classification-challenge/data

第一名：https://www.kaggle.com/c/otto-group-product-classification-challenge/discussion/14335
第二名：http://blog.kaggle.com/2015/06/09/otto-product-classification-winners-interview-2nd-place-alexander-guschin/

In [1]:
# 首先 import 必要的模块
import pandas as pd 
import numpy as np

## 读取数据

In [2]:
# 读取数据
# 请自行在log(x+1)特征和tf_idf特征上尝试，并比较不同特征的结果，
# 我们可以采用stacking的方式组合这几种不同特征编码的得到的模型
# path to where the data lies
dpath = './data/'
test1 = pd.read_csv(dpath +"Otto_FE_test_org.csv")
#test = pd.read_csv(dpath +"Otto_FE_test_log.csv")
test2 = pd.read_csv(dpath +"Otto_FE_test_tfidf.csv")

#去掉多余的id
test2 = test2.drop(["id"], axis=1)
test =  pd.concat([test1, test2], axis = 1, ignore_index=False)
test.head()

Unnamed: 0,id,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,...,feat_84_tfidf,feat_85_tfidf,feat_86_tfidf,feat_87_tfidf,feat_88_tfidf,feat_89_tfidf,feat_90_tfidf,feat_91_tfidf,feat_92_tfidf,feat_93_tfidf
0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.421803,0.052224,0.842245,0.0,0.0,0.0,0.0,0.0
1,2,0.032787,0.039216,0.21875,0.228571,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.143963,0.0,0.0,0.070171,0.0
2,3,0.0,0.019608,0.1875,0.014286,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.078248,0.0,0.0,0.0,0.0,0.071995
3,4,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,...,0.0,0.139311,0.034257,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.016393,0.0,0.0,0.014286,0.0,0.0,0.026316,0.026316,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.556178,0.0,0.0


## 准备数据

In [3]:
test_id = test['id']   
X_test = test.drop(["id"], axis=1)

#保存特征名字以备后用（可视化）
feat_names = X_test.columns 

#sklearn的学习器大多之一稀疏数据输入，模型训练会快很多
#查看一个学习器是否支持稀疏数据，可以看fit函数是否支持: X: {array-like, sparse matrix}.
#可自行用timeit比较稠密数据和稀疏数据的训练时间
from scipy.sparse import csr_matrix
X_test = csr_matrix(X_test)

In [4]:
#load训练好的模型
import cPickle

#lr_best = cPickle.load(open("Otto_L1_org.pkl", 'rb'))
#lr_best = cPickle.load(open("Otto_L2_log.pkl", 'rb'))
lr_best = cPickle.load(open("Otto_L1_org_tfidf.pkl", 'rb'))

#输出每类的概率
y_test_pred = lr_best.predict_proba(X_test)

In [5]:
y_test_pred.shape

(144368, 9)

In [6]:
#生成提交结果
out_df = pd.DataFrame(y_test_pred)

columns = np.empty(9, dtype=object)
for i in range(9):
    columns[i] = 'Class_' + str(i+1)

out_df.columns = columns

out_df = pd.concat([test_id,out_df], axis = 1)
out_df.to_csv("LR_org_tfidf.csv", index=False)

原始特征编码：在Kaggle的Private Leaderboard分数为0.66683，这个和交叉验证估计的误差（0.67284430278）差不多，交叉验证的结果是测试误差的很好估计。
log特征编码：在Kaggle的Private Leaderboard分数0.67317
tfidf特征：在Kaggle的Private Leaderboard分数0.63319

可以试试原始特征和tfidf两种特征编码串联在一起：在Kaggle的Private Leaderboard分数0.59817（排名第2243位）
