# 决策树算法应用：根据年龄、收入、是否学生、信用等级预测是否买电脑

+ 决策树算法的官方文档：http://scikit-learn.org/stable/modules/tree.html

+ 每一条数据都是 key-value 键值对，即 dict。
+ 特征处理：使用 DictVectorizer 将 dict 转换成独热编码。
+ 标签处理：LabelBinarizer 

我们首先看看数据文件是什么样子。

In [1]:
cat ./AllElectronics.csv

RID,age,income,student,credit_rating,class_buys_computer
1,youth,high,no,fair,no
2,youth,high,no,excellent,no
3,middle_aged,high,no,fair,yes
4,senior,medium,no,fair,yes
5,senior,low,yes,fair,yes
6,senior,low,yes,excellent,no
7,middle_aged,low,yes,excellent,yes
8,youth,medium,no,fair,no
9,youth,low,yes,fair,yes
10,senior,medium,yes,fair,yes
11,youth,medium,yes,excellent,yes
12,middle_aged,medium,no,excellent,yes
13,middle_aged,high,yes,fair,yes
14,senior,medium,no,excellent,no


extraction 提取

In [2]:
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import tree
from sklearn import preprocessing
from sklearn.externals.six import StringIO

fr = data = open('./AllElectronics.csv')
reader = csv.reader(fr)
headers = next(reader)
headers

['RID', 'age', 'income', 'student', 'credit_rating', 'class_buys_computer']

+ 使用 csv 的 reader 方法得到的是一个可迭代对象，须要使用 next() 方法得到数据

In [3]:
feature_list = []
label_list = []

In [4]:
for row in reader:
    label_list.append(row[len(row) - 1])
    row_dict = {}
    for i in range(1,len(row)-1): # 第 1 列是序号，不用存储，因为这一列不参与决策分类
        row_dict[headers[i]] = row[i]
    feature_list.append(row_dict)

In [5]:
# 看看 feature_list 的前 5 项
feature_list[:5]

[{'age': 'youth', 'credit_rating': 'fair', 'income': 'high', 'student': 'no'},
 {'age': 'youth',
  'credit_rating': 'excellent',
  'income': 'high',
  'student': 'no'},
 {'age': 'middle_aged',
  'credit_rating': 'fair',
  'income': 'high',
  'student': 'no'},
 {'age': 'senior',
  'credit_rating': 'fair',
  'income': 'medium',
  'student': 'no'},
 {'age': 'senior', 'credit_rating': 'fair', 'income': 'low', 'student': 'yes'}]

In [6]:
# 看看 label_list 的前 5 项
label_list[:5]

['no', 'no', 'yes', 'yes', 'yes']

In [7]:
# 来自 sklearn.feature_extraction 的 DictVectorizer 类
# 作用是转换为独热编码的格式
vec = DictVectorizer()
dummy_X = vec.fit_transform(feature_list).toarray()
dummy_X

array([[ 0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.],
       [ 0.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.]])

In [8]:
vec.get_feature_names()

['age=middle_aged',
 'age=senior',
 'age=youth',
 'credit_rating=excellent',
 'credit_rating=fair',
 'income=high',
 'income=low',
 'income=medium',
 'student=no',
 'student=yes']

In [9]:
cat ./AllElectronics.csv

RID,age,income,student,credit_rating,class_buys_computer
1,youth,high,no,fair,no
2,youth,high,no,excellent,no
3,middle_aged,high,no,fair,yes
4,senior,medium,no,fair,yes
5,senior,low,yes,fair,yes
6,senior,low,yes,excellent,no
7,middle_aged,low,yes,excellent,yes
8,youth,medium,no,fair,no
9,youth,low,yes,fair,yes
10,senior,medium,yes,fair,yes
11,youth,medium,yes,excellent,yes
12,middle_aged,medium,no,excellent,yes
13,middle_aged,high,yes,fair,yes
14,senior,medium,no,excellent,no


In [10]:
vec.get_feature_names()

['age=middle_aged',
 'age=senior',
 'age=youth',
 'credit_rating=excellent',
 'credit_rating=fair',
 'income=high',
 'income=low',
 'income=medium',
 'student=no',
 'student=yes']

In [11]:
label_list

['no',
 'no',
 'yes',
 'yes',
 'yes',
 'no',
 'yes',
 'no',
 'yes',
 'yes',
 'yes',
 'yes',
 'yes',
 'no']

处理单独一列成为独热编码。

In [12]:
lb = preprocessing.LabelBinarizer()
dummy_y = lb.fit_transform(label_list)
dummy_y

array([[0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0]])

In [13]:
len(dummy_X)

14

使用决策树算法进行训练

In [14]:
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(dummy_X, dummy_y)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

可视化决策树模型

In [15]:
with open('allElectronicInformationGainOri.dot','w') as fw:
    f = tree.export_graphviz(clf,feature_names= vec.get_feature_names(),out_file=fw)

In [16]:
ls

01-What-is-Decision-Tree.ipynb
AllElectronics.csv
CART 与决策树中的超参数.ipynb
Untitled.ipynb
allElectronicInformationGainOri.dot
信息熵.ipynb
基尼系数.ipynb
什么是决策树算法？.ipynb
使用信息熵寻找最优划分.ipynb


此时看到当前文件夹下已经有  `allElectronicInformationGainOri.dot` 文件生成了。我们可以看看这个文件。

In [17]:
cat allElectronicInformationGainOri.dot

digraph Tree {
node [shape=box] ;
0 [label="age=middle_aged <= 0.5\nentropy = 0.94\nsamples = 14\nvalue = [5, 9]"] ;
1 [label="student=yes <= 0.5\nentropy = 1.0\nsamples = 10\nvalue = [5, 5]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="age=senior <= 0.5\nentropy = 0.722\nsamples = 5\nvalue = [4, 1]"] ;
1 -> 2 ;
3 [label="entropy = 0.0\nsamples = 3\nvalue = [3, 0]"] ;
2 -> 3 ;
4 [label="credit_rating=fair <= 0.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
2 -> 4 ;
5 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
4 -> 5 ;
6 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
4 -> 6 ;
7 [label="credit_rating=excellent <= 0.5\nentropy = 0.722\nsamples = 5\nvalue = [1, 4]"] ;
1 -> 7 ;
8 [label="entropy = 0.0\nsamples = 3\nvalue = [0, 3]"] ;
7 -> 8 ;
9 [label="age=youth <= 0.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
7 -> 9 ;
10 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
9 -> 10 ;
11 [label="entropy = 0.0

使用工具可视化决策树模型。

graphviz 软件官方网址： http://www.graphviz.org/

+ 转化 dot 文件至 pdf 可视化决策树的命令：

```
dot -Tpdf iris.dot -o output.pdf
```

In [18]:
!dot -Tpdf allElectronicInformationGainOri.dot -o output.pdf

In [19]:
ls

01-What-is-Decision-Tree.ipynb
AllElectronics.csv
CART 与决策树中的超参数.ipynb
Untitled.ipynb
allElectronicInformationGainOri.dot
output.pdf
信息熵.ipynb
基尼系数.ipynb
什么是决策树算法？.ipynb
使用信息熵寻找最优划分.ipynb


在当前文件夹下就可以看到 output.pdf 文件。

下面开始预测。

In [20]:
one_row_X = dummy_X[0,:]
one_row_X

array([ 0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  0.])

In [26]:
one_row_X.reshape(1,-1)

array([[ 0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  0.]])

In [27]:
clf.predict(one_row_X.reshape(1,-1))

array([0])

对数据稍作修改，再预测一下。

In [31]:
new_row_X = one_row_X
new_row_X[0] = 1
new_row_X[2] = 0
print("newRowX: " + str(new_row_X))

predicted_y = clf.predict(new_row_X.reshape(1,-1))
print("predictedY: " + str(predicted_y))

newRowX: [ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.]
predictedY: [1]
