# 해커톤 데이터셋

## Input Data
- 데이터셋의 각 column은 날짜정보와 종목정보, 그리고 Feature set으로 이루어져 있습니다. Feature set은 blur 처리되어 있습니다.
- Feature는 각 종목들의 유의미하다고 판단되는 데이터 값으로 이루어져 있습니다.
- <strong>T221 이후의 데이터는 테스트 셋으로 분리 (약 17,000개 전체 데이터의 약 20%)</strong>

## Target Data
- Train set에 대해서는 정답 데이터 2개가 주어지고, Test set에 대해서는 정답이 주어지지 않습니다.
- 정답 데이터들은 각 샘플 시간 기준으로 다음 단위시간(T) 수익률로 만들어집니다.
    - 정답 데이터1: 단위시간(T) 수익률 (train_target.csv) -> Regression
    - 정답 데이터2: 특정 한 시점에서 종목들의 단위 시간 수익률을 5분위로 나누어 분류한 Target (train_target2.csv) -> Classification
- <strong>T221 이후의 데이터는 테스트 셋으로 분리 (약 17,000개 전체 데이터의 약 20%)</strong>
    
### 세부사항
- 어떤 정답을 학습시키느냐에 따라 regression 접근, 혹은 5분위 중 어떤 위치에 있을지 예측하는 Classification 접근법이 있습니다.
- 이를 포함해서 재량에 따라 데이터에 변형을 가하는 등의 접근 방식을 써도 좋습니다.
    - 예를 들어 classification 문제로 예시를 들면 수익률을 5분위 대신 2분위로 나누어서 binary classification으로 변형해도 좋습니다. 다만 변형할 경우 해당 부분에 대한 설명의 기재를 부탁드립니다.
- 모든 데이터를 학습에 사용할 필요는 없습니다.
- 실제 모델 결과보다 모델을 만들기까지의 과정이 중요합니다.
- 어떠한 논리로 분석을 진행하였는지 설명을 세부적으로 적어주시길 바랍니다.
- 머신러닝을 사용한 모델링 과정을 하나 이상 넣어 주시길 바랍니다.
    - Deep learning 모델 외에 다른 모델을 사용하는 것은 제한이 없습니다.

In [9]:
import pandas as pd
import numpy as np

X_train = pd.read_csv('./data/train_data.csv')
Y_train = pd.read_csv('./data/train_target.csv')
Y2_train = pd.read_csv('./data/train_target2.csv')
X_train.head()

Unnamed: 0,td,code,F001,F002,F003,F004,F005,F006,F007,F008,...,F037,F038,F039,F040,F041,F042,F043,F044,F045,F046
0,T001,A005,7.267364,0.004896,0.945559,-0.828748,0.641026,-0.038719,0.015282,-1.015634,...,-1.567398,0.007646,0.002793,1.0,0.004724,-1.041667,13.357401,0.793424,11.347518,0.54447
1,T001,A006,-7.477904,-0.000128,1.089255,0.042335,7.640449,0.038965,0.016616,-0.631765,...,-1.033058,-0.001463,-0.002713,2.0,-0.004431,2.040816,-14.464286,0.546866,-4.960317,3.91478
2,T001,A007,7.622525,0.001413,1.260723,0.001667,13.735577,0.02574,0.01253,6.140861,...,7.648485,0.003168,-0.000951,6.0,-0.004544,0.0,13.052749,0.523903,-1.228115,9.910044
3,T001,A011,51.693204,0.0,6.967351,0.268144,-11.543311,0.143675,0.033834,0.401105,...,1.358087,0.037001,-0.004078,2.0,0.012924,-7.142857,156.242771,1.050259,137.679277,-2.97993
4,T001,A012,-7.707446,-0.000763,1.201887,0.285988,21.070234,-0.006894,0.017134,0.497051,...,0.835655,-0.059726,-0.000538,5.0,-4.5e-05,0.0,-17.351598,0.865144,-17.539863,12.087614


In [8]:
!pwd



/Users/hwamoc/Documents/workspcase_jupyter


In [10]:
 Y_train.head()

Unnamed: 0,td,code,target
0,T001,A005,-0.041401
1,T001,A006,-0.010438
2,T001,A007,-0.04263
3,T001,A011,0.109743
4,T001,A012,0.058011


In [11]:
X_train = X_train.set_index(['td', 'code'])
Y_train = Y_train.set_index(['td', 'code'])
Y2_train = Y2_train.set_index(['td', 'code'])

In [4]:
X_train.shape, Y_train.shape, Y2_train.shape

((83564, 46), (83564, 1), (83564, 1))

In [12]:
X_train.describe()

Unnamed: 0,F001,F002,F003,F004,F005,F006,F007,F008,F009,F010,...,F037,F038,F039,F040,F041,F042,F043,F044,F045,F046
count,80147.0,72494.0,82138.0,83448.0,83448.0,69456.0,82518.0,82504.0,71432.0,70884.0,...,83448.0,80338.0,71058.0,83383.0,70049.0,67884.0,83448.0,83559.0,83448.0,83102.0
mean,7.89437,0.000915,3.212757,0.202291,2.075079,0.052969,0.024497,1.753955,0.909854,-2.7e-05,...,6.097276,0.003239,0.001637,3.776357,0.003051,0.678024,23.730528,1.000124,17.703937,0.49589
std,22.1013,0.03923,17.267118,2.020238,14.543002,0.069505,0.011561,8.342232,0.601707,0.016406,...,29.899738,0.328621,0.025129,1.994943,0.034708,8.502843,76.855077,0.269099,61.836535,4.649171
min,-52.092279,-3.832419,0.169382,-76.84,-59.385189,-0.866186,0.00569,-26.092628,-0.165837,-0.367137,...,-70.21978,-43.672986,-0.727923,1.0,-0.727923,-92.5,-92.017722,0.013118,-89.84329,-27.129616
25%,-6.103297,-0.001195,0.919447,-0.02,-5.553225,0.020651,0.017816,-4.073285,0.440529,-0.001739,...,-8.556635,-0.006791,-0.002727,2.0,-0.0054,-1.198204,-12.293946,0.864735,-11.29578,-2.745036
50%,3.184114,0.0,1.551248,0.0,0.440529,0.042782,0.022341,0.917061,0.8,0.0,...,1.724138,0.002463,0.0,4.0,0.0,0.0,5.0,0.985998,4.21326,0.234027
75%,16.197738,0.0004,3.090994,0.059686,7.606289,0.073355,0.028253,6.597034,1.234568,0.001397,...,14.653135,0.014523,0.002716,5.0,0.00732,2.083333,36.567964,1.102857,29.559748,3.478132
max,470.723992,4.133958,983.606913,345.8,691.925065,1.670955,0.383658,109.968661,10.0,2.307626,...,1563.043478,5.31708,1.160028,7.0,1.160612,207.692308,3125.524476,21.820324,2900.0,56.84502


In [13]:
# imputation with -1
X_train.fillna(-1, inplace = True)

## Modeling
# ExtraTreesClassifier Example

In [14]:
# use classifier in example
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier(n_estimators=100, max_depth = 20)
# 만약 regression이라면 Y_train 사용
model.fit(X_train.values, Y2_train.values)

print(model.score(X_train.values, Y2_train.values))

  """


0.8009070891771576


## Make prediction

In [15]:
pred = model.predict(X_train)
submission = pd.DataFrame(pred, columns = ['target'], index = X_train.index)
submission.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,target
td,code,Unnamed: 2_level_1
T001,A005,0
T001,A006,2
T001,A007,0
T001,A011,4
T001,A012,4


# Linear Regression Example

In [16]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train.values, Y_train.values)
# Returns the coefficient of determination R^2 of the prediction
reg.score(X_train.values, Y_train.values)

0.0032104879989707236

In [17]:
coef = pd.DataFrame(reg.coef_)
coef

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,36,37,38,39,40,41,42,43,44,45
0,9e-05,-0.003629,8e-06,6.7e-05,1e-05,-0.003642,0.006307,0.000299,0.001195,-0.001923,...,7e-06,0.002367,-0.004595,0.000119,0.010762,1.1e-05,-3.7e-05,0.002605,-1.8e-05,-4.5e-05


In [18]:
pred2 = reg.predict(X_train)
submission2 = pd.DataFrame(pred2, columns = ['target'], index = X_train.index)
submission2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,target
td,code,Unnamed: 2_level_1
T001,A005,-0.002998
T001,A006,-0.003085
T001,A007,0.001564
T001,A011,0.00087
T001,A012,-0.001922


## 다양한 모델 사용법은 scikit learn package에서 찾으면 나옴
### ex. Random Foreset Classifier : https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html