## Creating datasets.

We will learn by doing while reading through chapter 10 of the pytorch book. The book focused far too much on the specific example and ignored introducing the abstract necessary components (other than the init, getitem and len) of efficiently constructing dataset to us. Our dataset structure will mimic the book's, while differ in detail, as there might be things that are simply not important to us. The following will be updated VERY FREQUENTLY, as this is very much a learning process. 

Due to the fact that the creation of raw feature can be time-consuming, we should offer chance of loading preexisting features, and make sure the feature creation is separate from the dataset. Due to the fact that the raw dataframe storing features can be unnecessarily large, we prob should not commit memory in dataset object to store the dataframe itself. Notice, there might be condition required upon the data, for instance, time_id and stock_id; and there might be different kinds of data, like simple tabular features, or timeseries like features. 

In [1]:
import torch 
from torch.utils.data import Dataset, DataLoader
import pandas as pd

We load a precalculated timeseries like feature. 

In [2]:
book_time=pd.read_parquet("../processed_data/book_RV_ts_60_si.parquet")

In [7]:
len(book_time[(book_time["stock_id"]==99)&(book_time["time_id"]==9976)])

60

Pivot will be our best friend. 

In [4]:
book_time_pivot=book_time.pivot(index="row_id", columns="sub_int_num", values="sub_int_RV")

In [5]:
book_time_pivot

sub_int_num,1,2,3,4,5,6,7,8,9,10,...,51,52,53,54,55,56,57,58,59,60
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0-1000,0.000341,0.000000,0.000023,0.000000,0.000170,3.818799e-07,0.000089,0.000552,0.000012,0.000000,...,0.000265,0.000000,0.000214,0.000003,0.000000,0.000118,2.313288e-04,1.060893e-05,0.000111,0.000288
0-10000,0.000290,0.000191,0.000087,0.000193,0.000241,3.154886e-04,0.000000,0.000247,0.000265,0.000000,...,0.000202,0.000375,0.000616,0.000564,0.000000,0.000023,3.777703e-08,3.777703e-08,0.000020,0.000310
0-10005,0.000000,0.000000,0.001554,0.002177,0.002303,4.375100e-04,0.000617,0.001199,0.002306,0.001215,...,0.000000,0.000486,0.000050,0.001761,0.001617,0.001801,2.552987e-03,5.364106e-04,0.000872,0.000000
0-10017,0.000142,0.000142,0.001464,0.001086,0.000068,6.771948e-05,0.000899,0.000064,0.000593,0.000451,...,0.000029,0.000000,0.001293,0.002092,0.000994,0.000848,3.104404e-03,1.224910e-03,0.001316,0.003287
0-10030,0.000327,0.000058,0.000293,0.000842,0.000120,2.586782e-04,0.000221,0.000436,0.000099,0.000008,...,0.000410,0.000437,0.000004,0.000215,0.000457,0.000183,4.842470e-04,0.000000e+00,0.000756,0.000005
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99-9972,0.000197,0.000181,0.000171,0.000172,0.000369,2.503467e-04,0.000349,0.000356,0.000390,0.000050,...,0.000075,0.000185,0.000314,0.000318,0.000115,0.000143,9.916624e-05,1.809852e-04,0.000334,0.000089
99-9973,0.000821,0.000346,0.000691,0.001591,0.000863,9.650211e-04,0.000504,0.001925,0.000641,0.000382,...,0.001081,0.001095,0.000425,0.000789,0.001295,0.000596,1.862600e-03,7.668418e-04,0.001035,0.002115
99-9976,0.000569,0.001101,0.001002,0.000430,0.000797,7.203531e-04,0.000586,0.000538,0.000570,0.000781,...,0.000508,0.000406,0.000662,0.000338,0.000710,0.000179,5.946683e-04,1.509652e-04,0.000388,0.000403
99-9988,0.000040,0.000069,0.000123,0.000056,0.000016,1.957402e-04,0.000071,0.000095,0.000063,0.000030,...,0.000034,0.000176,0.000140,0.000129,0.000175,0.000019,1.440137e-04,3.200939e-05,0.000041,0.000007


In [None]:
#Following is very much a work in progress. 
class RVdataset(Dataset): 
    def __init__(self, time_id_list=None, stock_id_list=None, tab_features=None, ts_features=None, target="target", df_ts_feat=None, df_tab_feat=None, df_target=None):
        """
        :param time_id_list: Defaulted to None, in which case ALL time_id's will be included. A list containing the time_id's of interest.
        :param stock_id_list: Defaulted to None, in which case ALL stock_id's will be included. A list containing the stock_id's of interst. 
        :param tab_features: Defaulted to None, in which case NO feature will be included. A list containing the string of names of columns in df_tab_feat to be used as tabular features. 
        :param ts_features: Defaulted to None, in which case NO feature will be included. A dictionary of form {string name of column of time series feature: string of name of the index for the time serie feature, ...}, for instance, {sub_int_RV:sub_int_num} in book_time created in data_processing_functions.ipynb. 
        :param target: Defaulted to "target". The string indicating how target is identified in column index of dataframe. 
        :param df_ts_feat: Defaulted to None. The dataframe containing the time series like features, must have "row_id" as identifier. 
        :param df_tab_feat: Defaulted to None. The dataframe containing the tabluar features, must have "row_id" as identifier. 
        :param df_target: Defaulted to None, in which case, target will be searched in df_tab_feat instead and expects df_tab_feat to contain target column to be used as target. The dataframe containing the target stored in the target column, must have "row_id" to be used as identifier. 
        """
        super().__init__()
        
    def __getitem__(self, index):
        # return super().__getitem__(index)