# Lab 4-2: Load Data

Author: Seungjae Lee (이승재)

<div class="alert alert-warning">
    We use elemental PyTorch to implement linear regression here. However, in most actual applications, abstractions such as <code>nn.Module</code> or <code>nn.Linear</code> are used.
</div>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
cd drive/My Drive/class20211/PyTorch

/content/drive/My Drive/class20211/PyTorch


## Loading Data from `.csv` file

In [3]:
import numpy as np

In [4]:
xy = np.loadtxt('data-01-test-score.csv', delimiter=',', dtype=np.float32)

In [5]:
x_data = xy[:, 0:-1]
y_data = xy[:, [-1]]

In [6]:
print(x_data.shape) # x_data shape
print(len(x_data))  # x_data 길이
print(x_data[:5])   # 첫 다섯 개

(25, 3)
25
[[ 73.  80.  75.]
 [ 93.  88.  93.]
 [ 89.  91.  90.]
 [ 96.  98. 100.]
 [ 73.  66.  70.]]


In [7]:
print(y_data.shape) # y_data shape
print(len(y_data))  # y_data 길이
print(y_data[:5])   # 첫 다섯 개

(25, 1)
25
[[152.]
 [185.]
 [180.]
 [196.]
 [142.]]


## Mini-batch

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [9]:
# For reproducibility
torch.manual_seed(1)

<torch._C.Generator at 0x7f46528355b8>

In [10]:
from torch.utils.data import Dataset

In [11]:
class CustomDataset(Dataset):
    def __init__(self):
        self.x_data = torch.FloatTensor([[73, 80, 75],
                                          [93, 88, 93],
                                          [89, 91, 90],
                                          [96, 98, 100],
                                          [73, 66, 70]])
        self.y_data = torch.FloatTensor([[152], [185], [180], [196], [142]])

    def __len__(self):
        return len(self.x_data)

    def __getitem__(self, idx):
        x = torch.FloatTensor(self.x_data[idx])
        y = torch.FloatTensor(self.y_data[idx])

        return x, y

dataset = CustomDataset()


In [12]:
from torch.utils.data import DataLoader

In [13]:
dataloader = DataLoader(
    dataset,
    batch_size = 2,
    shuffle = True,
)

In [14]:
class MultivariateLinearRegressionModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(3, 1)

    def forward(self, x):
        return self.linear(x)

In [16]:
# 데이터
x_train = torch.FloatTensor(x_data)
y_train = torch.FloatTensor(y_data)
# 모델 초기화
model = MultivariateLinearRegressionModel()
# optimizer 설정
optimizer = optim.SGD(model.parameters(), lr=1e-5)

nb_epochs = 20

for epoch in range(nb_epochs+1):

    for batch_idx, samples in enumerate(dataloader):
        x_train, y_train = samples 
    
        # H(x) 계산
        prediction = model(x_train)
        
        # cost 계산
        cost = F.mse_loss(prediction, y_train)
        
        # cost로 H(x) 개선
        optimizer.zero_grad()
        cost.backward()
        optimizer.step()
        
        # 20번마다 로그 출력
        print('Epoch {:4d}/{} Batch {}/{} Cost: {:.6f}'.format(
            epoch, nb_epochs, batch_idx+1, len(dataloader), cost.item()
        ))

Epoch    0/20 Batch 1/3 Cost: 28494.527344
Epoch    0/20 Batch 2/3 Cost: 10967.986328
Epoch    0/20 Batch 3/3 Cost: 5100.812988
Epoch    1/20 Batch 1/3 Cost: 682.938171
Epoch    1/20 Batch 2/3 Cost: 230.594070
Epoch    1/20 Batch 3/3 Cost: 91.501534
Epoch    2/20 Batch 1/3 Cost: 22.564404
Epoch    2/20 Batch 2/3 Cost: 5.264287
Epoch    2/20 Batch 3/3 Cost: 0.666594
Epoch    3/20 Batch 1/3 Cost: 0.254771
Epoch    3/20 Batch 2/3 Cost: 0.412526
Epoch    3/20 Batch 3/3 Cost: 1.510413
Epoch    4/20 Batch 1/3 Cost: 0.508034
Epoch    4/20 Batch 2/3 Cost: 0.028077
Epoch    4/20 Batch 3/3 Cost: 0.157345
Epoch    5/20 Batch 1/3 Cost: 0.537081
Epoch    5/20 Batch 2/3 Cost: 0.217656
Epoch    5/20 Batch 3/3 Cost: 0.058559
Epoch    6/20 Batch 1/3 Cost: 0.043198
Epoch    6/20 Batch 2/3 Cost: 0.588489
Epoch    6/20 Batch 3/3 Cost: 0.011543
Epoch    7/20 Batch 1/3 Cost: 0.121354
Epoch    7/20 Batch 2/3 Cost: 0.603847
Epoch    7/20 Batch 3/3 Cost: 0.005891
Epoch    8/20 Batch 1/3 Cost: 0.439329
Epoch   

Without Batch

In [17]:
# 데이터
x_train = torch.FloatTensor(x_data)
y_train = torch.FloatTensor(y_data)
# 모델 초기화
model = MultivariateLinearRegressionModel()
# optimizer 설정
optimizer = optim.SGD(model.parameters(), lr=1e-5)

nb_epochs = 20
for epoch in range(nb_epochs+1): 
    
    # H(x) 계산
    prediction = model(x_train)
    
    # cost 계산
    cost = F.mse_loss(prediction, y_train)
    
    # cost로 H(x) 개선
    optimizer.zero_grad()
    cost.backward()
    optimizer.step()
    
    # 20번마다 로그 출력
    print('Epoch {:4d}/{} Cost: {:.6f}'.format(
        epoch, nb_epochs, cost.item()
    ))

Epoch    0/20 Cost: 9201.987305
Epoch    1/20 Cost: 3406.534424
Epoch    2/20 Cost: 1263.794556
Epoch    3/20 Cost: 471.564056
Epoch    4/20 Cost: 178.654236
Epoch    5/20 Cost: 70.357231
Epoch    6/20 Cost: 30.316614
Epoch    7/20 Cost: 15.512158
Epoch    8/20 Cost: 10.038234
Epoch    9/20 Cost: 8.014143
Epoch   10/20 Cost: 7.265511
Epoch   11/20 Cost: 6.988441
Epoch   12/20 Cost: 6.885746
Epoch   13/20 Cost: 6.847506
Epoch   14/20 Cost: 6.833104
Epoch   15/20 Cost: 6.827510
Epoch   16/20 Cost: 6.825186
Epoch   17/20 Cost: 6.824053
Epoch   18/20 Cost: 6.823368
Epoch   19/20 Cost: 6.822850
Epoch   20/20 Cost: 6.822397


## Dataset and DataLoader

<div class="alert alert-warning">
    pandas 기초지식이 필요할 것 같다
</div>

너무 데이터가 크면 `x_data`, `y_data` 를 전부 다 가져오지 말고, 필요한 배치만 가져올 수 밖에 없다.

[PyTorch Data Loading and Processing tutorial](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#iterating-through-the-dataset)