# 前言

自从上次试着用最基础的线性回归训练一个有80个特征的数据集，梯度爆炸之后，今天拿一个简单到不能再简单的数据集试试能不能成功收敛。途中我们又会遇到什么问题？

## 数据集
来自吴恩达机器学习课程第二周的课后练习。原本是txt文件，我通过下面三行代码把数据集另存为了csv，可以在这里[下载](https://github.com/linguoguo/data_science/blob/master/house_pricing/data/house_2_features.csv)。

### 读取数据集

数据没有分训练集和测试集，房子的特征只有面积和房间数两个。
我们将通过`pandas`库读取并处理数据 

导入这里需要的包

In [1]:
%matplotlib inline
import d2lzh as d2l
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import data as gdata, loss as gloss, nn
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('data/house/house_2_features.csv' ,index_col=0)

In [3]:
data.head()

Unnamed: 0,size,bedroom,price
0,1600,3,329900
1,2400,3,369000
2,1416,2,232000
3,3000,4,539900
4,1985,4,299900


In [4]:
data.shape

(46, 3)

### 预处理数据集

我们对连续数值的特征做`标准化（standardization)`：设该特征在整个数据集上的均值为$\mu$，标准差为$\sigma$。那么，我们可以将该特征的每个值先减去$\mu$再除以$\sigma$得到标准化后的每个特征值。对于缺失的特征值，我们将其替换成该特征的均值。

In [5]:
data = data.apply(
    lambda x: (x - x.mean()) / (x.std()))

data.fillna(0);

标准化后，每个特征的均值变为0，所以可以直接用0来替换缺失值。

In [6]:
data.head()

Unnamed: 0,size,bedroom,price
0,-0.495977,-0.226166,-0.07311
1,0.499874,-0.226166,0.236953
2,-0.725023,-1.526618,-0.849457
3,1.246762,1.074287,1.59219
4,-0.016724,1.074287,-0.31101


把数据集分成两部分，训练集和测试集，并通过`values`属性得到NumPy格式的数据，并转成`NDArray`方便后面的训练。

In [7]:
n_train=36
train_features = nd.array(data[['size','bedroom']][:n_train].values)
test_features = nd.array(data[['size','bedroom']][n_train:].values)
train_labels = nd.array(data.price[:n_train].values).reshape((-1, 1))
test_labels = nd.array(data.price[n_train:].values).reshape((-1, 1))

In [8]:
train_features.shape

(36, 2)

In [9]:
train_features[:3]


[[-0.4959771  -0.22616564]
 [ 0.4998739  -0.22616564]
 [-0.72502285 -1.526618  ]]
<NDArray 3x2 @cpu(0)>

### 定义模型

我们使用一个基本的线性回归模型和平方损失函数来训练模型。 关于更多gluon使用的步骤请参考这里

In [10]:
net = nn.Sequential()
net.add(nn.Dense(1))

### 初始化模型参数

In [11]:
net.initialize(init.Normal(sigma=0.01))

### 定义损失函数

In [12]:
loss = gloss.L2Loss()

### 定义优化算法

创建一个`Trainer`实例，并指定学习率为0.03的小批量随机梯度下降（`sgd`）为优化算法。该优化算法将用来迭代`net`实例所有通过`add`函数嵌套的层所包含的全部参数。这些参数可以通过`collect_params`函数获取。

In [13]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.03})

### 训练模型
随机读取包含batch_size个数据样本的小批量

In [14]:
batch_size=2
train_iter = gdata.DataLoader(gdata.ArrayDataset(train_features, train_labels), batch_size, shuffle=True)

In [15]:
num_epochs = 40
for epoch in range(1, num_epochs + 1):
    for X, y in train_iter:
        with autograd.record():
            l = loss(net(X), y)
        l.backward()
        trainer.step(batch_size)
    l = loss(net(train_features), train_labels)
    print('epoch %d, loss: %f' % (epoch, l.mean().asnumpy()))

epoch 1, loss: 0.246766
epoch 2, loss: 0.175986
epoch 3, loss: 0.154656
epoch 4, loss: 0.145171
epoch 5, loss: 0.139558
epoch 6, loss: 0.135769
epoch 7, loss: 0.133386
epoch 8, loss: 0.131887
epoch 9, loss: 0.131060
epoch 10, loss: 0.130636
epoch 11, loss: 0.130462
epoch 12, loss: 0.130103
epoch 13, loss: 0.130018
epoch 14, loss: 0.129933
epoch 15, loss: 0.129811
epoch 16, loss: 0.129775
epoch 17, loss: 0.129756
epoch 18, loss: 0.129881
epoch 19, loss: 0.129783
epoch 20, loss: 0.129748
epoch 21, loss: 0.129793
epoch 22, loss: 0.129718
epoch 23, loss: 0.129762
epoch 24, loss: 0.129740
epoch 25, loss: 0.129720
epoch 26, loss: 0.129723
epoch 27, loss: 0.129789
epoch 28, loss: 0.129730
epoch 29, loss: 0.129753
epoch 30, loss: 0.129713
epoch 31, loss: 0.129727
epoch 32, loss: 0.129721
epoch 33, loss: 0.129731
epoch 34, loss: 0.129759
epoch 35, loss: 0.129845
epoch 36, loss: 0.129771
epoch 37, loss: 0.129713
epoch 38, loss: 0.129727
epoch 39, loss: 0.129765
epoch 40, loss: 0.129871


## 后记
暂时看训练是能收敛的，损失也比上次少很多很多。下次我们再看几个问题：
+ 怎么算测试集的房价
+ 有没有过拟
+ 损失函数的结果怎么看，是大还是小

新手村的小伙伴们，你们有什么看法呢？

书接上一回 我们训练了一个线性回归模型，数据集为有两个特征，46个样本的房价预测。
# 预测结果
怎么算测试集的房价，我昨天脑子秀逗了，果然抄代码一时爽，一直抄代码一直爽，爽到后面的代码都没有看了！午夜梦回，突然想起，我当时是怎么算的损失函数？
我开心地去看看结果，好像有那么一丢丢大了点。

In [16]:
y_predit=net(test_features)
l = loss(y_predit, test_labels)
print(l.mean().asnumpy())

[0.1614004]


# 怎么看损失函数
我都不知道损失函数的取值是多少，知道那么多种损失函数有什么意义？兹 傲娇脸
上网找不到资料就自己看看吧，先看看数据集的取值

In [17]:
data.describe()

Unnamed: 0,size,bedroom,price
count,46.0,46.0,46.0
mean,-9.171408000000001e-17,1.339508e-16,-4.3443510000000006e-17
std,1.0,1.0,1.0
min,-1.427098,-2.82707,-1.34191
25%,-0.7082178,-0.2261656,-0.7075102
50%,-0.1598774,-0.2261656,-0.3110103
75%,0.3560979,1.074287,0.2359614
max,3.086597,2.374739,2.860989


0.17的损失函数跟1的方差比好像不是很大。。。直到这里，我们的初步看法都是“好像”，“差不多”，“大概”。。。 作为一个未来的大神，怎么可以对自己要求这么低。

我们把结果打印出来看看

In [18]:
for i in range(10):
    print(test_labels[i],y_predit[i])


[0.0466327]
<NDArray 1 @cpu(0)> 
[0.20947975]
<NDArray 1 @cpu(0)>

[1.6643525]
<NDArray 1 @cpu(0)> 
[2.6858356]
<NDArray 1 @cpu(0)>

[-0.41330725]
<NDArray 1 @cpu(0)> 
[0.24514496]
<NDArray 1 @cpu(0)>

[0.23298769]
<NDArray 1 @cpu(0)> 
[-0.332606]
<NDArray 1 @cpu(0)>

[-0.07311028]
<NDArray 1 @cpu(0)> 
[0.34264287]
<NDArray 1 @cpu(0)>

[-0.19919726]
<NDArray 1 @cpu(0)> 
[0.7266256]
<NDArray 1 @cpu(0)>

[-0.31814724]
<NDArray 1 @cpu(0)> 
[-0.8913743]
<NDArray 1 @cpu(0)>

[-1.2626102]
<NDArray 1 @cpu(0)> 
[-1.297945]
<NDArray 1 @cpu(0)>

[-0.31101027]
<NDArray 1 @cpu(0)> 
[-0.12339576]
<NDArray 1 @cpu(0)>

[-0.7899822]
<NDArray 1 @cpu(0)> 
[-0.8878077]
<NDArray 1 @cpu(0)>


看前三个对比也差太远了吧，按百分比再算一遍。

In [19]:
for i in range(10):
    print( ((test_labels[i]-y_predit[i])*100/test_labels[i]).asnumpy(),'%')

[-349.2121] %
[-61.374203] %
[159.31302] %
[242.75688] %
[568.6658] %
[464.7769] %
[-180.17665] %
[-2.7985537] %
[60.32421] %
[-12.383257] %


我要用这个数据看一个房子值不值得买，会亏到没裤衩吧。最大差5倍，我自闭了，这个结果肯定是不行的！看看每平方均价

In [20]:
data['price_size']=data['price']/data['size']
data['price_bedroom']=data['price']/data['bedroom']
data.describe()

Unnamed: 0,size,bedroom,price,price_size,price_bedroom
count,46.0,46.0,46.0,46.0,46.0
mean,-9.171408000000001e-17,1.339508e-16,-4.3443510000000006e-17,2.387465,0.450725
std,1.0,1.0,1.0,5.833545,2.856655
min,-1.427098,-2.82707,-1.34191,-3.711958,-8.442439
25%,-0.7082178,-0.2261656,-0.7075102,0.341222,-0.286206
50%,-0.1598774,-0.2261656,-0.3110103,0.969627,0.455811
75%,0.3560979,1.074287,0.2359614,1.594585,1.826074
max,3.086597,2.374739,2.860989,32.073789,4.913015


看每平方价钱，75%小于1.59，然而最大的数去到32，有点大得离谱了，可能这个数据集来自不同地方或者不同类型的房子，也可能有输入错误？我们现在怎么办？吴恩达第六周的课程好像可以给我们答案，我们且看下回分解。

新加入两个特征训练一下，第一次不小心就出现Nan了，果然对学习率很敏感啊，不小心就梯度爆炸了，此处可参考这篇博文。

In [29]:
n_train=36
train_features = nd.array(data[['size','bedroom','price_size','price_bedroom']][:n_train].values)
test_features = nd.array(data[['size','bedroom','price_size','price_bedroom']][n_train:].values)
train_labels = nd.array(data.price[:n_train].values).reshape((-1, 1))
test_labels = nd.array(data.price[n_train:].values).reshape((-1, 1))

net = nn.Sequential()
net.add(nn.Dense(1))

net.initialize(init.Normal(sigma=0.01))

loss = gloss.L2Loss()

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.03})

batch_size=2
train_iter = gdata.DataLoader(gdata.ArrayDataset(train_features, train_labels), batch_size, shuffle=True)
num_epochs = 40
for epoch in range(1, num_epochs + 1):
    for X, y in train_iter:
        with autograd.record():
            l = loss(net(X), y)
        l.backward()
        trainer.step(batch_size)
    l = loss(net(train_features), train_labels)
    print('epoch %d, loss: %f' % (epoch, l.mean().asnumpy()))

epoch 1, loss: 2190.526367
epoch 2, loss: 710725.562500
epoch 3, loss: 284985280.000000
epoch 4, loss: 227231154176.000000
epoch 5, loss: 85445782274048.000000
epoch 6, loss: 87628233465397248.000000
epoch 7, loss: 37323010165786542080.000000
epoch 8, loss: 4764028720072496250880.000000
epoch 9, loss: 6082304267000688419012608.000000
epoch 10, loss: 6613091637294114079303008256.000000
epoch 11, loss: 2311386075767621512855384227840.000000
epoch 12, loss: 909418451767419360324547522854912.000000
epoch 13, loss: 287194146490916168776753067186978816.000000
epoch 14, loss: inf
epoch 15, loss: inf
epoch 16, loss: inf
epoch 17, loss: inf
epoch 18, loss: inf
epoch 19, loss: inf
epoch 20, loss: inf
epoch 21, loss: inf
epoch 22, loss: inf
epoch 23, loss: inf
epoch 24, loss: inf
epoch 25, loss: inf
epoch 26, loss: inf
epoch 27, loss: nan
epoch 28, loss: nan
epoch 29, loss: nan
epoch 30, loss: nan
epoch 31, loss: nan
epoch 32, loss: nan
epoch 33, loss: nan
epoch 34, loss: nan
epoch 35, loss: nan


In [30]:
n_train=36
train_features = nd.array(data[['size','bedroom','price_size','price_bedroom']][:n_train].values)
test_features = nd.array(data[['size','bedroom','price_size','price_bedroom']][n_train:].values)
train_labels = nd.array(data.price[:n_train].values).reshape((-1, 1))
test_labels = nd.array(data.price[n_train:].values).reshape((-1, 1))

net = nn.Sequential()
net.add(nn.Dense(1))

net.initialize(init.Normal(sigma=0.01))

loss = gloss.L2Loss()

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})

batch_size=2
train_iter = gdata.DataLoader(gdata.ArrayDataset(train_features, train_labels), batch_size, shuffle=True)
num_epochs = 40
for epoch in range(1, num_epochs + 1):
    for X, y in train_iter:
        with autograd.record():
            l = loss(net(X), y)
        l.backward()
        trainer.step(batch_size)
    l = loss(net(train_features), train_labels)
    print('epoch %d, loss: %f' % (epoch, l.mean().asnumpy()))

epoch 1, loss: 0.280175
epoch 2, loss: 0.252596
epoch 3, loss: 0.247836
epoch 4, loss: 3.694590
epoch 5, loss: 0.971740
epoch 6, loss: 0.563291
epoch 7, loss: 2.774067
epoch 8, loss: 0.801260
epoch 9, loss: 0.179848
epoch 10, loss: 0.149339
epoch 11, loss: 1.921999
epoch 12, loss: 0.128801
epoch 13, loss: 0.215510
epoch 14, loss: 0.193499
epoch 15, loss: 0.514674
epoch 16, loss: 0.250536
epoch 17, loss: 0.112822
epoch 18, loss: 0.170746
epoch 19, loss: 0.232429
epoch 20, loss: 0.146247
epoch 21, loss: 1.763635
epoch 22, loss: 1.205917
epoch 23, loss: 0.113654
epoch 24, loss: 0.110538
epoch 25, loss: 0.166103
epoch 26, loss: 0.147857
epoch 27, loss: 0.178377
epoch 28, loss: 0.152319
epoch 29, loss: 0.131106
epoch 30, loss: 0.399766
epoch 31, loss: 0.109490
epoch 32, loss: 0.546786
epoch 33, loss: 0.150535
epoch 34, loss: 0.521338
epoch 35, loss: 2.954722
epoch 36, loss: 0.106273
epoch 37, loss: 0.139209
epoch 38, loss: 0.422830
epoch 39, loss: 0.115425
epoch 40, loss: 0.105446


这里看到loss有时会突然变大，可以看出我们已经在最优解左右徘徊，可以了，我们测试一下：

In [31]:
y_predit=net(test_features)
l = loss(y_predit, test_labels)
print(l.mean().asnumpy())

[0.1441034]


# 过拟合还是欠拟合？
测试集的0.144 和训练集的0.105差不是很多，但是结果不算好。直觉告诉我应该是欠拟合，因为这里是偏差比较大的。发现一个写得不错的文章：[神经网络:欠拟合和过拟合](https://www.jianshu.com/p/9b6b0d6d3bd0) 还有一篇关于过度训练的[过拟合详解：监督学习中不准确的「常识」](https://www.jiqizhixin.com/articles/2019-01-25-23)
最后关于欠拟合过拟合这件事，我还是不能只靠直觉，还是要用更专业的方法 [学习曲线——判断欠拟合还是过拟合](https://blog.csdn.net/geduo_feng/article/details/79547554)
### 预告
+ 明天我们再来看看怎么用mxnet写学习曲线。
+ 看其它更好的算法
+ 更多提取特征的方法