# 삼성전자 주가 데이터 다운로드
- Yahoo Finance 에서 주가 데이터 다운로드 (https://finance.yahoo.com/)
    - 검색 키워드 '005930.KS' 입력
- 검색 후 Historical Data 선택

![yahoo finance](figures/rnn/21_yahoo_stock1.png)

- `Start Date: 2000년 1월 4일 End Date: 오늘날짜` 선택
- **Apply 버튼** 클릭 후 다운로드
  
![yahoo finance](figures/rnn/22_yahoo_stock2.png)

In [1]:
import os
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import torchinfo

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split  

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cpu'

# DataLoading

In [2]:
df = pd.read_csv("dataset/005930.KS.csv")
df.shape

(6122, 7)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6122 entries, 0 to 6121
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       6122 non-null   object 
 1   Open       6122 non-null   float64
 2   High       6122 non-null   float64
 3   Low        6122 non-null   float64
 4   Close      6122 non-null   float64
 5   Adj Close  6122 non-null   float64
 6   Volume     6122 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 334.9+ KB


In [4]:
# Date 를 index 로 변환
df = df.set_index('Date')
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-04,6000.0,6110.0,5660.0,6110.0,4449.711426,74195000
2000-01-05,5800.0,6060.0,5520.0,5580.0,4063.72876,74680000
2000-01-06,5750.0,5780.0,5580.0,5620.0,4092.860107,54390000
2000-01-07,5560.0,5670.0,5360.0,5540.0,4034.597656,40305000
2000-01-10,5600.0,5770.0,5580.0,5770.0,4202.100586,46880000


In [6]:
df.drop(columns='Adj Close', inplace=True)

In [7]:
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-04,6000.0,6110.0,5660.0,6110.0,74195000
2000-01-05,5800.0,6060.0,5520.0,5580.0,74680000
2000-01-06,5750.0,5780.0,5580.0,5620.0,54390000
2000-01-07,5560.0,5670.0,5360.0,5540.0,40305000
2000-01-10,5600.0,5770.0,5580.0,5770.0,46880000


# Dataset 구성
## input, output data
- input (X)) feature 구성: \[Open, High, Low, Close, Volumn  (Adj Close 제외)\] 50일치
- output (y) : Close - input 다음날 Close가격

In [9]:
df_y = df['Close'].to_frame() 
df_X = df
print(df_X.shape, df_y.shape)

(6122, 5) (6122, 1)


## 전처리
- feature scaling
    - feature 간의 scaling(단위)을 맞추는 작업.
- X: Standard Scaling (평균: 0, 표준편차: 1)
- y: MinMax Scaling (최소: 0, 최대: 1)  => X의 scale과 비슷한 값으로 변환.

In [17]:
X_scaler = StandardScaler()
y_scaler = MinMaxScaler()
X = X_scaler.fit_transform(df_X)
y = y_scaler.fit_transform(df_y)
print(X.shape, y.shape)

(6122, 5) (6122, 1)


In [18]:
X[:5]

array([[-0.99007851, -0.98965518, -1.00112858, -0.98571511,  3.51992614],
       [-0.99894601, -0.99185497, -1.00738983, -1.00923676,  3.55212512],
       [-1.00116289, -1.00417382, -1.00470644, -1.00746154,  2.20507915],
       [-1.00958702, -1.00901336, -1.01454554, -1.01101197,  1.26998095],
       [-1.00781352, -1.00461378, -1.00470644, -1.00080447,  1.70649289]])

In [19]:
y[:5]

array([[0.03829161],
       [0.0322873 ],
       [0.03274046],
       [0.03183415],
       [0.03443979]])

## Input Sequential Data 구성
- X: 50일치 데이터(ex:1일 ~ 50일), y: 51일째 주가. (ex: 51일)
    - 50일의 연속된 주식데이터를 학습하여 51일째 주가를 예측한다.
    - X의 한개의 데이터가 50일치 주가데이터가 된다.

![img](figures/rnn/20_stock_dataset.png)

[연속된 날짜가 5인 경우]

In [20]:
time_steps = 50 # seq_length (몇일치 주가를 하나의 데이터로 묶을 지.)
data_X = [] # input data들 모을 리스트. X 1개 shape: (50(time_stemps), 5)
data_y = [] # output data를 모을 리스트.

for idx in range(0, y.size-time_steps): # 데이터를 구성할 수있는 51개 행이 남을때 까지 반복.
    # idx: 0  X: 0 ~ 49,   y: 50
    # idx: 1  X: 1 ~ 50,   y: 51
    _X = X[idx:time_steps+idx] 
    _y = y[time_steps+idx]
    data_X.append(_X)
    data_y.append(_y)

In [25]:
np.shape(data_X) # (6072:batch, 50:seq_len, 5:개별 seq의 feature수)

(6072, 50, 5)

In [26]:
np.shape(data_y)

(6072, 1)

## Train / test set 분리

In [27]:
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.2)

In [28]:
np.shape(X_train), np.shape(X_test)

((4857, 50, 5), (1215, 50, 5))

In [31]:
# list -> ndarray 변환 (List를 넣어서 Tensor를 생성하면 속도가 느림. )
X_train, X_test, y_train, y_test = np.array(X_train), np.array(X_test), np.array(y_train), np.array(y_test)

## Dataset, DataLoader 구성

In [33]:
# Dataset
## 메모리에 있는 tensor가  raw data일 때 TensorDataset으로 생성.
train_set = TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
test_set = TensorDataset(torch.tensor(X_test), torch.tensor(y_test))

len(train_set), len(test_set)

(4857, 1215)

In [34]:
train_loader = DataLoader(train_set, batch_size=200, shuffle=True, drop_last=True)
test_loader = DataLoader(test_set, batch_size=200)

len(train_loader), len(test_loader)

(24, 7)

# 모델 정의

In [None]:
class StockPriceModel(nn.Module):

    def __init__(self, input_size, hidden_size, num_layers, bidirectional=True, dropout_rate=0.3):
        super().__init__()
        # X -> LSTM -(마지막hidden)->Dropout->Linear -> y
        self.lstm = nn.LSTM(
            input_size=input_size, # 개별 seq(하루치 X)의 feature수
            hidden_size=hidden_size,
            num_layers=num_layers,
            bidirectional=bidirectional,
            dropout=dropout_rate
        )
        self.dropout = nn.Dropout(dropout_rate)
        i_features = hidden_size * 2 if bidirectional else hidden_size
        self.lr = nn.Linear(i_features,  1)  # 출력: 가격 1개.
        self.sigmoid = nn.Sigmoid()  # y: 0 ~ 1 범위이므로 sigmoid사용해서 범위를 맞춘다.

    def forward(self, X):
        pass
    

### train

# 마지막 데이터로 다음날 주식가격 추론