用于预测输入变量（即自变量）和输出变量（即因变量）之间的关系

下面将使用经典的Auto MPG（Mile Per Gallon）数据集，演示如何基于PyTorch构建深度网络模型来预测汽车的燃油效率。

# Preprocessing

Auto MPG数据集共有398条记录，每条记录有9列数据，分别记录各种车的燃油效率、气缸数、排量、马力、重量、加速性能、车型年份、原产地共8个特征，以及汽车型号。

In [31]:
import pandas as pd
dataset_path = r'D:\JJCCHuang\Study\git_code\Practice_of_AI_Intro_XB\CH5 Pytorch Basic\auto-mpg.data'
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
raw_dataset = pd.read_csv(dataset_path, names=column_names, na_values = "?",comment='\t', sep=" ", skipinitialspace=True)
dataset = raw_dataset.copy()
dataset  # 展示数据集内容

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1
...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1


In [32]:
dataset = dataset.dropna() # 删除未知数值数据行
dataset  # 再次展示数据集内容

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1
...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1


In [33]:
# 对原产地进行one-hot编码
origin = dataset.Origin
dataset['USA'] = (origin == 1) * 1.0
dataset['Europe'] = (origin == 2) * 1.0
dataset['Japan'] = (origin == 3) * 1.0
dataset  # 再次展示数据集内容

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['USA'] = (origin == 1) * 1.0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Europe'] = (origin == 2) * 1.0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Japan'] = (origin == 3) * 1.0


Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin,USA,Europe,Japan
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,1.0,0.0,0.0
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,1.0,0.0,0.0
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,1.0,0.0,0.0
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,1.0,0.0,0.0
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790.0,15.6,82,1,1.0,0.0,0.0
394,44.0,4,97.0,52.0,2130.0,24.6,82,2,0.0,1.0,0.0
395,32.0,4,135.0,84.0,2295.0,11.6,82,1,1.0,0.0,0.0
396,28.0,4,120.0,79.0,2625.0,18.6,82,1,1.0,0.0,0.0


In [34]:
dataset.pop('Origin')
dataset

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,USA,Europe,Japan
0,18.0,8,307.0,130.0,3504.0,12.0,70,1.0,0.0,0.0
1,15.0,8,350.0,165.0,3693.0,11.5,70,1.0,0.0,0.0
2,18.0,8,318.0,150.0,3436.0,11.0,70,1.0,0.0,0.0
3,16.0,8,304.0,150.0,3433.0,12.0,70,1.0,0.0,0.0
4,17.0,8,302.0,140.0,3449.0,10.5,70,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790.0,15.6,82,1.0,0.0,0.0
394,44.0,4,97.0,52.0,2130.0,24.6,82,0.0,1.0,0.0
395,32.0,4,135.0,84.0,2295.0,11.6,82,1.0,0.0,0.0
396,28.0,4,120.0,79.0,2625.0,18.6,82,1.0,0.0,0.0


In [41]:
# 拆分数据集，以便评估模型
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)

In [43]:
# 计算各个特征的统计值
train_stats = train_dataset.describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()

In [44]:
# 将训练集和测试集的训练特征和标签分离，标签是模型需要预测的值，即燃油效率
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')


为了能使模型更加快速地收敛，将各特征做归一化处理

In [45]:
def norm(x):
    return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

# Model

In [46]:
# 将构建一个简单的神经网络回归模型，其中包含两个全连接层，以及一个输出层。

import torch
from torch import nn

class Model(nn.Module):
    def __init__(self, input_dim, middle_dim):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(input_dim, middle_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(middle_dim, middle_dim)
        self.out = nn.Linear(middle_dim, 1)

    def forward(self, x):
        return self.out(self.relu(self.fc2(self.relu(self.fc1(x))))).view(-1) #view(-1)调整为1维，-1让pytorch自动推断输出张量的形状
    
model = Model(len(train_dataset.keys()), 64) #train_dataset.keys()的长度即为特征数

# Train

In [54]:
from torch.utils.data import TensorDataset, DataLoader
from torch import optim

train_dataset = TensorDataset(torch.from_numpy(normed_train_data.values).float()
                             ,torch.from_numpy(train_labels.values).float())
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)

test_dataset = TensorDataset(torch.from_numpy(normed_test_data.values).float()
                             ,torch.from_numpy(test_labels.values).float())
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, num_workers=4)

criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

epochs=20
for epoch in range(epochs):
    for feats, labels in train_loader: # 获取特征和标签
        optimizer.zero_grad() # 清除优化器的梯度
        y = model(feats) # 将特征输入模型，得到模型的输出（预测值）
        loss = criterion(y, labels)
        loss.backward()
        optimizer.step() # 更新参数，通常在每次计算完损失函数的梯度后调用，以实现梯度下降的过程
    
    if (epoch + 1) % 4 == 0:
        print('Epoch {}/{}: loss {}'.format(epoch + 1, epochs, loss.item()))

Epoch 4/20: loss 7.815014362335205
Epoch 8/20: loss 4.476006984710693
Epoch 12/20: loss 7.489941120147705
Epoch 16/20: loss 2.0759294033050537
Epoch 20/20: loss 8.096097946166992


# Test

In [55]:
preds = torch.empty((0,))
gts = torch.empty((0,))

with torch.no_grad(): # 表示在测试过程中不需要计算梯度，可以节省内存和时间
    for feats, labels in test_loader:
        y = model(feats)
        # 使用torch.cat()方法，将预测值和真实值拼接到之前创建的张量上，形成一个一维的向量
        preds = torch.cat((preds, y), dim=0)
        gts = torch.cat((gts, labels), dim=0)
    err = criterion(preds, gts)
    print('Testing set MSE:{:5.2f} MPG'.format(err.item()))

Testing set MSE: 6.25 MPG
