* 写一个函数，能将一个多类别变量转为多个二元虚拟变量，不能使用sklearn库。  
* 写一个函数，实现交叉验证的功能，不能使用sklearn库。  
  给一组数据，最后分成两组。输入一个大矩阵，输出两个小矩阵。
* 使用sklearn库中的其他分类方法，来预测titanic的生存情况。  
  用其他分类方法看看效果如何。  
* 研究kaggle中的Digit Recognizer数据，尝试用一些特征工程来提取数字的特征，并放入分类器中观察预测准确率，相对直接使用原始变量是否有提升。  
  用图像识别对应数字。用两大类方式做，一种原始数据扔进去做；另一种方式用特征工程。

# 1.多类别变量转为二元虚拟变量 

## 1.1 基本思想  

我们将采用一对多方法，将多个类中的一个类标记为正向类(y=1)，然后将其他所有类都标记为负向类，模型记为：$h^{(1)}_\theta(x)$。  
接着，类似地选择另一个类标记为正向类(y=2)，再将其他类都标记为负向类，将整个模型记为：$h^{(2)}_\theta(x)$，以此类推。  

最后得到一系列的模型：$h^i_\theta(x)=p(y=i|x;\theta)，其中i=(1,2,3...k)$。  
需要预测时，我们将所有的分类机都运行一遍，然后对每一个输入变量都选择最高可能性的输出变量。

对于本题目，处理思路类似：  
我们将$n$个类别的数据转换为$n$组数据，分别用$smf.logit$去拟合。  

## 1.2 Python实现

In [1]:
def one_vs_all(dataset):
    
    import numpy as np
    
    X = dataset.data
    target = dataset.target
    n = dataset.target_names.shape[0]
    targets = np.zeros((len(target), n))
    
    for i in (range(n+1)[1:]):
        targets[:,i-1] = np.array((map(lambda x: i if x==(i-1) else -i,target)))
    return X, targets

In [2]:
from sklearn import datasets
dataset = datasets.load_iris()

In [3]:
X,y = one_vs_all(dataset)

In [4]:
X.shape

(150, 4)

In [5]:
y.shape

(150, 3)

# 2.交叉验证 

将数据分为训练集和测试集，由用户确定拆分比例，输入一个矩阵，输出拆分后的两个矩阵。

In [12]:
def my_cross_validation(data, target, train_size): # train_size 表示训练集的比例

    n = int(round(train_size*len(target)))
    X_train = data[0:n,:]
    X_test = data[n:,:]
    y_train = target[0:n]
    y_test = target[n:]
    
    return X_train, X_test, y_train, y_test

In [13]:
X_train, X_test, y_train, y_test = my_cross_validation(dataset.data,dataset.target,0.68)

In [14]:
X_train.shape, X_test.shape

((102, 4), (48, 4))

In [15]:
y_train.shape, y_test.shape

((102,), (48,))

# 3.Titanic 

In [21]:
import pandas as pd

## 3.1 读取并预览数据

In [22]:
train = pd.read_csv("~/DataScience/OMOOC.Data/assignment/7w/train.csv", delimiter=",", encoding=None, header=0)

In [23]:
test = pd.read_csv("~/DataScience/OMOOC.Data/assignment/7w/test.csv", delimiter=",", encoding=None, header=0)

In [24]:
gendermodel = pd.read_csv("~/DataScience/OMOOC.Data/assignment/7w/gendermodel.csv", delimiter=",", encoding=None, header=0)

In [25]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [26]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [27]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [28]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


## 3.2 数据预处理  

#### 处理原则：  
* 尽可能保留指标，不随意剔除。  
* 剔除缺失数据较多的指标，以及明显与存活率无太多相关性的指标。  

#### 处理如下：  
* 剔除id,name，与结果没太多关系。剔除ticket，不好标准化，用票价可以代替。  
* 考虑到Cabin数据缺失太多，可以考虑使用Embarked数据，因为Embarked和票价数据在一定程度上可以说明Cabin。
* 剔除无年龄、无Embarked数据的行。  
* 将sex设为0和1。

In [29]:
train_data = train.drop(train.columns[[3,8,10]],axis=1)

In [30]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.25,S
1,2,1,1,female,38.0,1,0,71.2833,C
2,3,1,3,female,26.0,0,0,7.925,S
3,4,1,1,female,35.0,1,0,53.1,S
4,5,0,3,male,35.0,0,0,8.05,S


In [31]:
test_data = test.drop(test.columns[[2,7,9]],axis=1)

In [32]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,892,3,male,34.5,0,0,7.8292,Q
1,893,3,female,47.0,1,0,7.0,S
2,894,2,male,62.0,0,0,9.6875,Q
3,895,3,male,27.0,0,0,8.6625,S
4,896,3,female,22.0,1,1,12.2875,S


In [33]:
train_data.Sex = train_data.Sex.map({"male":1, "female":0})
test_data.Sex = test_data.Sex.map({"male":1, "female":0})
#train_data["Sex"] = train_data["Sex"].apply(lambda x: x == "male",1,0)
#test_data["Sex"] = test_data["Sex"].apply(lambda x: x == "male",1,0)

In [34]:
train_data.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [35]:
test_data.Embarked.value_counts()

S    270
C    102
Q     46
Name: Embarked, dtype: int64

In [36]:
train_data.Embarked = train_data.Embarked.map({"S":0, "C":1, "Q":2})
test_data.Embarked = test_data.Embarked.map({"S":0, "C":1, "Q":2})

In [37]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int64
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null float64
dtypes: float64(3), int64(6)
memory usage: 62.7 KB


In [38]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Sex            418 non-null int64
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Fare           417 non-null float64
Embarked       418 non-null int64
dtypes: float64(2), int64(6)
memory usage: 26.2 KB


In [39]:
train_dataset = train_data.dropna(axis=0)

In [40]:
train_dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,1,22.0,1,0,7.25,0.0
1,2,1,1,0,38.0,1,0,71.2833,1.0
2,3,1,3,0,26.0,0,0,7.925,0.0
3,4,1,1,0,35.0,1,0,53.1,0.0
4,5,0,3,1,35.0,0,0,8.05,0.0


In [41]:
train_dataset.shape

(712, 9)

In [42]:
test_data.shape

(418, 8)

In [43]:
gendermodel.shape

(418, 2)

In [44]:
test_data["Survived"] = gendermodel["Survived"]

In [45]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Survived
0,892,3,1,34.5,0,0,7.8292,2,0
1,893,3,0,47.0,1,0,7.0,0,1
2,894,2,1,62.0,0,0,9.6875,2,0
3,895,3,1,27.0,0,0,8.6625,0,0
4,896,3,0,22.0,1,1,12.2875,0,1


In [46]:
test_dataset = test_data.dropna(axis=0)

In [47]:
test_dataset.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Survived
0,892,3,1,34.5,0,0,7.8292,2,0
1,893,3,0,47.0,1,0,7.0,0,1
2,894,2,1,62.0,0,0,9.6875,2,0
3,895,3,1,27.0,0,0,8.6625,0,0
4,896,3,0,22.0,1,1,12.2875,0,1


In [48]:
test_dataset.shape

(331, 9)

## 3.3 构造数据  

以下数据均为剔除NA后的数据。

In [49]:
X_train = np.vstack([train_dataset["Pclass"],train_dataset["Sex"],train_dataset["Age"],\
                     train_dataset["SibSp"],train_dataset["Parch"],train_dataset["Fare"],\
                    train_dataset["Embarked"]]).T

In [50]:
X_train.shape

(712, 7)

In [51]:
y_train = np.vstack([train_dataset["Survived"]]).T

In [52]:
y_train.shape

(712, 1)

In [53]:
X_test = np.vstack([test_dataset["Pclass"],test_dataset["Sex"],test_dataset["Age"],\
                     test_dataset["SibSp"],test_dataset["Parch"],test_dataset["Fare"],\
                    test_dataset["Embarked"]]).T

In [54]:
y_test = np.vstack([test_dataset["Survived"]]).T  # 使用网站的数据

## 3.4 预测

In [55]:
from sklearn import datasets
from sklearn import cross_validation
from sklearn import linear_model
from sklearn import metrics
from sklearn import tree
from sklearn import neighbors
from sklearn import svm
from sklearn import ensemble
from sklearn import cluster

%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import seaborn as sns

### 3.4.1 逻辑回归

In [56]:
classifier = linear_model.LogisticRegression()
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)

print(metrics.confusion_matrix(y_test, y_test_pred))
print(metrics.classification_report(y_test,y_test_pred))

  y = column_or_1d(y, warn=True)


[[192  12]
 [  7 120]]
             precision    recall  f1-score   support

          0       0.96      0.94      0.95       204
          1       0.91      0.94      0.93       127

avg / total       0.94      0.94      0.94       331



In [57]:
results = np.array([test_dataset["PassengerId"],y_test_pred]).T

In [58]:
results.shape

(331, 2)

In [59]:
compared_results = pd.Series(results[:,1],results[:,0])  

In [60]:
compared_results.to_csv("./logitregres.csv")

以上的数据因为对NA数据行做了删除，导致预测结果数据缺失，提交kaggle失败。

In [61]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 9 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Sex            418 non-null int64
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Fare           417 non-null float64
Embarked       418 non-null int64
Survived       418 non-null int64
dtypes: float64(2), int64(7)
memory usage: 29.5 KB


**age和Fare中的na值取平均值。**

In [62]:
test_data['Age'].mean()

30.272590361445783

In [63]:
test_data['Age']=test_data['Age'].fillna(30.272590361445783);

In [64]:
test_data['Fare'].mean()

35.6271884892086

In [65]:
test_data['Fare']=test_data['Fare'].fillna(test_data['Fare'].mean());

In [66]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 9 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Sex            418 non-null int64
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Fare           418 non-null float64
Embarked       418 non-null int64
Survived       418 non-null int64
dtypes: float64(2), int64(7)
memory usage: 29.5 KB


#### 重新构造测试集数据

In [67]:
X_test2 = np.vstack([test_data["Pclass"],test_data["Sex"],test_data["Age"],\
                     test_data["SibSp"],test_data["Parch"],test_data["Fare"],\
                    test_data["Embarked"]]).T

In [68]:
X_test2.shape

(418, 7)

In [69]:
classifier = linear_model.LogisticRegression()
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test2)

results = np.array([test_data["PassengerId"],y_test_pred]).T
compared_results = pd.Series(results[:,1],results[:,0])  
compared_results.to_csv("./logitregres2.csv")

  y = column_or_1d(y, warn=True)


**提交结果：74%**

### 3.4.2 DecisionTree

In [434]:
classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test2)

results = np.array([test_data["PassengerId"],y_test_pred]).T
compared_results = pd.Series(results[:,1],results[:,0])  
compared_results.to_csv("./decisiontree.csv")
#print(metrics.confusion_matrix(y_test, y_test_pred))
#print(metrics.classification_report(y_test,y_test_pred))

**提交结果：67%**

### 3.4.3 KNN

In [438]:
classifier = neighbors.KNeighborsClassifier()
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test2)

results = np.array([test_data["PassengerId"],y_test_pred]).T
compared_results = pd.Series(results[:,1],results[:,0])  
compared_results.to_csv("./knn.csv")

  from ipykernel import kernelapp as app


**提交结果64%**

### 3.4.4 SVM

In [439]:
classifier = svm.SVC()
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test2)

results = np.array([test_data["PassengerId"],y_test_pred]).T
compared_results = pd.Series(results[:,1],results[:,0])  
compared_results.to_csv("./svm.csv")

  y_ = column_or_1d(y, warn=True)


**提交结果：59%**

### 3.4.5 随机森林

In [440]:
classifier = ensemble.RandomForestClassifier()
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test2)

results = np.array([test_data["PassengerId"],y_test_pred]).T
compared_results = pd.Series(results[:,1],results[:,0])  
compared_results.to_csv("./randomforest.csv")

  from ipykernel import kernelapp as app


**提交结果：73%**

## 3.5 小结

逻辑回归的结果较好。 

**可以改进的地方：**
* 训练集中剔除了NA数据，可能会影响模型整体效果，可以考虑对Age数据的NA值取平均值。  
* 港口数据应该考虑距离关系，设置不同值，可能效果会更好。  
* 数据没有标准化，标准化后结果应该会好很多。

# 4.Digit Recognizer

## 4.1 读取并概览数据

In [4]:
train = pd.read_csv("~/下载/train.csv", delimiter=",", encoding=None, header=0)

In [5]:
test = pd.read_csv("~/下载/test.csv", delimiter=",", encoding=None, header=0)

In [450]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42000 entries, 0 to 41999
Columns: 785 entries, label to pixel783
dtypes: int64(785)
memory usage: 251.5 MB


In [453]:
train.shape

(42000, 785)

In [455]:
test.shape

(28000, 784)

In [457]:
test.describe()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
count,28000.0,28000.0,28000.0,28000.0,28000.0,28000.0,28000.0,28000.0,28000.0,28000.0,...,28000.0,28000.0,28000.0,28000.0,28000.0,28000.0,28000.0,28000.0,28000.0,28000.0
mean,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.164607,0.073214,0.028036,0.01125,0.006536,0.0,0.0,0.0,0.0,0.0
std,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.473293,3.616811,1.813602,1.205211,0.807475,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,253.0,254.0,193.0,187.0,119.0,0.0,0.0,0.0,0.0,0.0


In [458]:
train

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4.2 构造数据

In [6]:
X_train = np.array(train)[:,1:]

In [7]:
X_train.shape

(42000, 784)

In [8]:
y_train = np.array(train)[:,0]

In [9]:
y_train.shape

(42000,)

In [11]:
X_test = np.array(test)

In [139]:
X_test.shape

(28000, 784)

## 4.3 直接预测

In [476]:
train.index.shape

(42000,)

### 4.3.1 LogisticRegression

In [None]:
classifier = linear_model.LogisticRegression()
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)

In [519]:
results = np.array([np.array(test.index+1),y_test_pred]).T
compared_results = pd.Series(results[:,1],results[:,0])  
compared_results.to_csv("./Digit_Recognizer_Output/logitregres.csv")

**提交结果：91%**

### 4.3.2 SVM

In [183]:
X_train.shape

(42000, 784)

In [184]:
y_train.shape

(42000,)

In [185]:
X_test.shape

(28000, 784)

In [186]:
classifier = svm.SVC()
classifier.fit(X_train, y_train)
y_test_pred_svm = classifier.predict(X_test)

In [187]:
results = np.array([np.array(test.index+1),y_test_pred_svm]).T
compared_results = pd.Series(results[:,1],results[:,0])  
compared_results.to_csv("./Digit_Recognizer_Output/svm.csv")

**提交结果，网站限制，放弃测试**

### 4.3.3 KNN

In [188]:
classifier = neighbors.KNeighborsClassifier()
classifier.fit(X_train, y_train)
y_test_pred_knn = classifier.predict(X_test)

In [189]:
results = np.array([np.array(test.index+1),y_test_pred_knn]).T
compared_results = pd.Series(results[:,1],results[:,0])  
compared_results.to_csv("./Digit_Recognizer_Output/knn.csv")

**提交结果：97%**

### 4.3.4 RandomForest

In [190]:
classifier = ensemble.RandomForestClassifier()
classifier.fit(X_train, y_train)
y_test_pred_rf = classifier.predict(X_test)

In [191]:
results = np.array([np.array(test.index+1),y_test_pred_rf]).T
compared_results = pd.Series(results[:,1],results[:,0])  
compared_results.to_csv("./Digit_Recognizer_Output/randomforest.csv")

**提交结果：94%**

### 4.3.5 DecisionTree

In [192]:
classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_test_pred_dt = classifier.predict(X_test)

In [193]:
results = np.array([np.array(test.index+1),y_test_pred_dt]).T
compared_results = pd.Series(results[:,1],results[:,0])  
compared_results.to_csv("./Digit_Recognizer_Output/decisiontree.csv")

**提交结果：86%**

## 4.4 特征工程

In [12]:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X_train, y_train)

print(model.feature_importances_);

[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   4.44010189e-06
   3.22373049e-06   7.43616072e-06   3.70488430e-06   1.84538741e-05
   9.96875764e-06   1.45656591e-05   0.00000000e+00   4.37849962e-06
   4.39549765e-06   3.26382664e-06   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00

In [13]:
features = model.feature_importances_

In [14]:
features.shape

(784,)

In [15]:
features.mean()

0.0012755102040816326

In [16]:
features = pd.DataFrame(np.array([features]).T, index=test.columns, columns=['importance'])

In [17]:
features.head()

Unnamed: 0,importance
pixel0,0.0
pixel1,0.0
pixel2,0.0
pixel3,0.0
pixel4,0.0


In [18]:
features["importance"].sort_values(axis=0, ascending=False).head(78)

pixel378    0.018814
pixel489    0.014785
pixel542    0.011267
pixel154    0.009936
pixel458    0.009027
pixel291    0.008959
pixel462    0.008450
pixel433    0.008221
pixel381    0.008114
pixel403    0.007899
pixel567    0.007560
pixel409    0.007137
pixel570    0.006642
pixel182    0.006625
pixel346    0.006567
pixel437    0.006453
pixel295    0.006159
pixel465    0.006077
pixel155    0.006031
pixel211    0.006006
pixel464    0.005987
pixel351    0.005953
pixel377    0.005925
pixel461    0.005746
pixel515    0.005738
pixel490    0.005635
pixel406    0.005577
pixel382    0.005503
pixel373    0.005387
pixel457    0.005371
              ...   
pixel460    0.004826
pixel319    0.004794
pixel153    0.004784
pixel655    0.004782
pixel374    0.004781
pixel539    0.004753
pixel597    0.004731
pixel210    0.004718
pixel518    0.004672
pixel595    0.004666
pixel513    0.004644
pixel488    0.004615
pixel404    0.004603
pixel324    0.004556
pixel322    0.004536
pixel237    0.004432
pixel658    0

In [19]:
index_columns = features["importance"].sort_values(axis=0, ascending=False).head(78).index # 选择十分之一

In [20]:
X_train_features = np.array(train[index_columns])[:]

In [25]:
train.index

RangeIndex(start=0, stop=42000, step=1)

In [21]:
index_columns

Index([u'pixel378', u'pixel489', u'pixel542', u'pixel154', u'pixel458',
       u'pixel291', u'pixel462', u'pixel433', u'pixel381', u'pixel403',
       u'pixel567', u'pixel409', u'pixel570', u'pixel182', u'pixel346',
       u'pixel437', u'pixel295', u'pixel465', u'pixel155', u'pixel211',
       u'pixel464', u'pixel351', u'pixel377', u'pixel461', u'pixel515',
       u'pixel490', u'pixel406', u'pixel382', u'pixel373', u'pixel457',
       u'pixel512', u'pixel432', u'pixel486', u'pixel402', u'pixel400',
       u'pixel484', u'pixel350', u'pixel401', u'pixel543', u'pixel238',
       u'pixel347', u'pixel511', u'pixel239', u'pixel266', u'pixel405',
       u'pixel206', u'pixel358', u'pixel656', u'pixel460', u'pixel319',
       u'pixel153', u'pixel655', u'pixel374', u'pixel539', u'pixel597',
       u'pixel210', u'pixel518', u'pixel595', u'pixel513', u'pixel488',
       u'pixel404', u'pixel324', u'pixel322', u'pixel237', u'pixel658',
       u'pixel318', u'pixel544', u'pixel399', u'pixel212', u'pix

In [143]:
X_train_features.shape

(42000, 78)

In [146]:
y_train.shape

(42000,)

In [147]:
test.shape

(28000, 784)

In [148]:
X_test_features = np.array(test[index_columns])[:]

In [151]:
X_test_features.shape

(28000, 78)

### 标准化

In [155]:
from sklearn import preprocessing

standardized_X_train_features = preprocessing.scale(X_train_features)
standardized_X_test_features = preprocessing.scale(X_test_features)




In [154]:
standardized_X_train_features.shape

(42000, 78)

In [156]:
standardized_X_test_features.shape

(28000, 78)

## 另一种确定特征工程的方法

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X_train, y_train)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

## 4.5 重新预测

In [161]:
from sklearn.naive_bayes import GaussianNB

classifiers = [tree.DecisionTreeClassifier,
               neighbors.KNeighborsClassifier,
               svm.SVC,
               GaussianNB,
               ensemble.RandomForestClassifier,
               linear_model.LogisticRegression]

In [173]:
y_test_pred = np.zeros([len(standardized_X_test_features), len(classifiers)])

for i,Classifier in enumerate(classifiers):
    classifier = Classifier()
    classifier.fit(standardized_X_train_features, y_train)
    y_test_pred[:,i] = classifier.predict(standardized_X_test_features)

In [175]:
y_test_pred.shape

(28000, 6)

In [176]:
results = np.array([np.array(test.index+1),y_test_pred[:,0]]).T
compared_results = pd.Series(results[:,1],results[:,0])
compared_results.to_csv("./Digit_Recognizer_Output2/decisiontree.csv")

In [177]:
results = np.array([np.array(test.index+1),y_test_pred[:,1]]).T
compared_results = pd.Series(results[:,1],results[:,0])
compared_results.to_csv("./Digit_Recognizer_Output2/knn.csv")

In [178]:
results = np.array([np.array(test.index+1),y_test_pred[:,2]]).T
compared_results = pd.Series(results[:,1],results[:,0])
compared_results.to_csv("./Digit_Recognizer_Output2/svm.csv")

In [179]:
results = np.array([np.array(test.index+1),y_test_pred[:,3]]).T
compared_results = pd.Series(results[:,1],results[:,0])
compared_results.to_csv("./Digit_Recognizer_Output2/bayes.csv")

In [180]:
results = np.array([np.array(test.index+1),y_test_pred[:,4]]).T
compared_results = pd.Series(results[:,1],results[:,0])
compared_results.to_csv("./Digit_Recognizer_Output2/radomforest.csv")

In [181]:
results = np.array([np.array(test.index+1),y_test_pred[:,5]]).T
compared_results = pd.Series(results[:,1],results[:,0])
compared_results.to_csv("./Digit_Recognizer_Output2/logisregression.csv")

### 提交结果：

* Bayes: 79%  
* DecisionTree: 50%
* KNN: 95%  
* LogisRegression: 85%  
* RandomForest: 90%  
* SVM: 1.5% 

说明：后面两种方法超过了网站的限制，暂时未能测试，测试后补充。

## 4.6 小结  

* 特征工程的选择可以极大提高运算速度，数据集过大时，最好提取特征工程。  
* 特征工程选择时要注意，本题中，选择了前1/10，发现最小的重要性比平均值还大很多，可能会损失一些特征，影响最终结果。  
* 在计算时需要注意先对数据标准化。  


说明：直接预测时，除逻辑回归外，其他算法还在运行中（速度太慢），运行完后更新。