## Datasetの説明
カリフォルニアの住宅価格のデータ。地理的に近い場所をグループ化して、統計量をまとめたものである。  
MedIncを予測するためのデータセット。
- MedInc: 住宅価格の中央値
- HouseAge: 住宅の築年数の中央値
- AveRooms: 1軒あたりの部屋数の平均値
- AveBedrms: 1軒あたりの寝室数の平均値
- Population: グループ化した地区の人口
- AveOccup: 家族人数の平均値
- Latitude: グループ化した地区の緯度
- Longitude: グループ化した地区の経度

In [1]:
import pandas as pd
from sklearn.datasets import fetch_california_housing

df = fetch_california_housing(as_frame=True).data
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


## 前処理

この後、Neural Networkの活性化関数としてReLUを使うので、0以上の値に変更した方が良い。

In [2]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(df)
df = pd.DataFrame(scaler.transform(df), columns=df.columns)
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,0.539668,0.784314,0.043512,0.020469,0.008941,0.001499,0.567481,0.211155
1,0.538027,0.392157,0.038224,0.018929,0.06721,0.001141,0.565356,0.212151
2,0.466028,1.0,0.052756,0.02194,0.013818,0.001698,0.564293,0.210159
3,0.354699,1.0,0.035241,0.021929,0.015555,0.001493,0.564293,0.209163
4,0.230776,1.0,0.038534,0.022166,0.015752,0.001198,0.564293,0.209163


## 使用するパラメータ

In [3]:
batch_size = 64
learning_rate = 0.01

## Datasetの作成
Datasetは以下の3つのメソッドが必要になる。
- `__init__`: データの読み込みに使用する。
- `__len__`: データセットの件数を返す。
- `__getitem__`: インデックスを引数にして、戻り値は(特徴量, ラベル)にする。

In [4]:
import torch
from torch.utils.data import Dataset

class HousingDataset(Dataset):
    def __init__(self, df: pd.core.frame.DataFrame):
        self.features = torch.tensor(df.drop("MedInc", axis=1).values, dtype=torch.float32)
        self.labels = torch.tensor(df["MedInc"].values, dtype=torch.float32)

    def __len__(self) -> int:
        return len(self.features)

    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        return self.features[idx], self.labels[idx]

In [5]:
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader

# trainとtestに分割
train, test = train_test_split(df, random_state=42, shuffle=True)
training_data = HousingDataset(train)
test_data = HousingDataset(test)

train_dataloader = DataLoader(training_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

## Neural Networkモデルの実装
- nn.Moduleを継承する
- `super().__init__()`を忘れない
- 順伝播は`forward`という名前のメソッドにする

In [6]:
from torch import nn

class NeuralNetwork(nn.Module):
    def __init__(self, input_dim: int, output_dim: int):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, output_dim),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.mlp(x)

In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = NeuralNetwork(7, 1).to(device)

## パラメータの最適化
以下の3つのフェーズから成る
- `optimizer.zero_grad()`: パラメータの勾配をリセットする
- `loss.backward()`: 計算グラフを使って逆伝播を行う
- `optimizer.step()`: 勾配にしたがってパラメータを更新する

In [8]:
import matplotlib.pyplot as plt
%matplotlib inline

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        pred = model(X.to(device))
        loss = loss_fn(pred, y.unsqueeze(1).to(device))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            current = batch * batch_size + len(X)
            print(f"loss: {loss.item():>5f} [{current:>5d}/{size:>5d}]")

In [9]:
def test_loop(dataloader, model, loss_fn):
    model.eval()
    test_loss = 0

    # テストの時にはパラメータを更新しない
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X.to(device))
            test_loss += loss_fn(pred, y.unsqueeze(1).to(device)).item()

    test_loss /= len(dataloader)
    print(f"Average loss: {test_loss}")

In [10]:
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(10):
    print(f"Epoch {epoch + 1}\n------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)

Epoch 1
------------------
loss: 0.031597 [   64/15480]
loss: 0.018169 [ 6464/15480]
loss: 0.022313 [12864/15480]
Average loss: 0.01721460001980081
Epoch 2
------------------
loss: 0.019121 [   64/15480]
loss: 0.011099 [ 6464/15480]
loss: 0.014656 [12864/15480]
Average loss: 0.017035149405767887
Epoch 3
------------------
loss: 0.008281 [   64/15480]
loss: 0.014548 [ 6464/15480]
loss: 0.018449 [12864/15480]
Average loss: 0.016864555014044415
Epoch 4
------------------
loss: 0.019899 [   64/15480]
loss: 0.020773 [ 6464/15480]
loss: 0.012887 [12864/15480]
Average loss: 0.016803205731888243
Epoch 5
------------------
loss: 0.016843 [   64/15480]
loss: 0.011712 [ 6464/15480]
loss: 0.024456 [12864/15480]
Average loss: 0.016677577709664167
Epoch 6
------------------
loss: 0.012317 [   64/15480]
loss: 0.018188 [ 6464/15480]
loss: 0.019419 [12864/15480]
Average loss: 0.016675502809201492
Epoch 7
------------------
loss: 0.027372 [   64/15480]
loss: 0.017025 [ 6464/15480]
loss: 0.008660 [12864/