## Creating datasets.

We will learn by doing while reading through chapter 10 of the pytorch book. The book focused far too much on the specific example and ignored introducing the abstract necessary components (other than the init, getitem and len) of efficiently constructing dataset to us. Our dataset structure will mimic the book's, while differ in detail, as there might be things that are simply not important to us. The following will be updated VERY FREQUENTLY, as this is very much a learning process. 

Due to the fact that the creation of raw feature can be time-consuming, we should offer chance of loading preexisting features, and make sure the feature creation is separate from the dataset. Due to the fact that the raw dataframe storing features can be unnecessarily large, we prob should not commit memory in dataset object to store the dataframe itself. Notice, there might be condition required upon the data, for instance, time_id and stock_id; and there might be different kinds of data, like simple tabular features, or timeseries like features. 

In [1]:
import torch 
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import copy

We load a precalculated timeseries like feature. 

In [2]:
book_time=pd.read_parquet("../processed_data/book_RV_ts_60_si.parquet")

In [7]:
len(book_time[(book_time["stock_id"]==99)&(book_time["time_id"]==9976)])

60

Pivot will be our best friend. 

In [4]:
book_time_pivot=book_time.pivot(index="row_id", columns="sub_int_num", values="sub_int_RV")

In [5]:
book_time_pivot

sub_int_num,1,2,3,4,5,6,7,8,9,10,...,51,52,53,54,55,56,57,58,59,60
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0-1000,0.000341,0.000000,0.000023,0.000000,0.000170,3.818799e-07,0.000089,0.000552,0.000012,0.000000,...,0.000265,0.000000,0.000214,0.000003,0.000000,0.000118,2.313288e-04,1.060893e-05,0.000111,0.000288
0-10000,0.000290,0.000191,0.000087,0.000193,0.000241,3.154886e-04,0.000000,0.000247,0.000265,0.000000,...,0.000202,0.000375,0.000616,0.000564,0.000000,0.000023,3.777703e-08,3.777703e-08,0.000020,0.000310
0-10005,0.000000,0.000000,0.001554,0.002177,0.002303,4.375100e-04,0.000617,0.001199,0.002306,0.001215,...,0.000000,0.000486,0.000050,0.001761,0.001617,0.001801,2.552987e-03,5.364106e-04,0.000872,0.000000
0-10017,0.000142,0.000142,0.001464,0.001086,0.000068,6.771948e-05,0.000899,0.000064,0.000593,0.000451,...,0.000029,0.000000,0.001293,0.002092,0.000994,0.000848,3.104404e-03,1.224910e-03,0.001316,0.003287
0-10030,0.000327,0.000058,0.000293,0.000842,0.000120,2.586782e-04,0.000221,0.000436,0.000099,0.000008,...,0.000410,0.000437,0.000004,0.000215,0.000457,0.000183,4.842470e-04,0.000000e+00,0.000756,0.000005
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99-9972,0.000197,0.000181,0.000171,0.000172,0.000369,2.503467e-04,0.000349,0.000356,0.000390,0.000050,...,0.000075,0.000185,0.000314,0.000318,0.000115,0.000143,9.916624e-05,1.809852e-04,0.000334,0.000089
99-9973,0.000821,0.000346,0.000691,0.001591,0.000863,9.650211e-04,0.000504,0.001925,0.000641,0.000382,...,0.001081,0.001095,0.000425,0.000789,0.001295,0.000596,1.862600e-03,7.668418e-04,0.001035,0.002115
99-9976,0.000569,0.001101,0.001002,0.000430,0.000797,7.203531e-04,0.000586,0.000538,0.000570,0.000781,...,0.000508,0.000406,0.000662,0.000338,0.000710,0.000179,5.946683e-04,1.509652e-04,0.000388,0.000403
99-9988,0.000040,0.000069,0.000123,0.000056,0.000016,1.957402e-04,0.000071,0.000095,0.000063,0.000030,...,0.000034,0.000176,0.000140,0.000129,0.000175,0.000019,1.440137e-04,3.200939e-05,0.000041,0.000007


Following is an example of how pivot works. 

In [3]:
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',

                           'two'],

                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],

                   'baz': [1, 2, 3, 4, 5, 6],

                   'zoo': ['x', 'y', np.nan, 'q', 'w', np.nan]})

In [4]:
df

Unnamed: 0,foo,bar,baz,zoo
0,one,A,1,x
1,one,B,2,y
2,one,C,3,
3,two,A,4,q
4,two,B,5,w
5,two,C,6,


In [5]:
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])

Unnamed: 0_level_0,baz,baz,baz,zoo,zoo,zoo
bar,A,B,C,A,B,C
foo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
one,1,2,3,x,y,
two,4,5,6,q,w,


In [6]:
df_pv_dna=df.pivot(index='foo', columns='bar', values=['baz', 'zoo']).dropna(axis="columns")

In [7]:
df_pv_dna

Unnamed: 0_level_0,baz,baz,baz,zoo,zoo
bar,A,B,C,A,B
foo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
one,1,2,3,x,y
two,4,5,6,q,w


In [8]:
df_1=pd.DataFrame({"foo":["one","two"],"new":["peepee","poopoo"],"bar":[np.nan,np.nan]})

In [9]:
df_1=df_1.pivot(index="foo",columns="bar",values=["new"])

In [10]:
df_1

Unnamed: 0_level_0,new
bar,NaN
foo,Unnamed: 1_level_2
one,peepee
two,poopoo


In [11]:
df_merge=pd.merge(df_pv_dna,df_1,on="foo")

In [12]:
df_merge

Unnamed: 0_level_0,baz,baz,baz,zoo,zoo,new
bar,A,B,C,A,B,NaN
foo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
one,1,2,3,x,y,peepee
two,4,5,6,q,w,poopoo


In [13]:
df_merge.loc[:,"zoo"]

bar,A,B
foo,Unnamed: 1_level_1,Unnamed: 2_level_1
one,x,y
two,q,w


In [14]:
df_merge.loc[:,"baz"].shape

(2, 3)

In [16]:
df_merge.dropna(axis="rows")

Unnamed: 0_level_0,baz,baz,baz,zoo,zoo,new
bar,A,B,C,A,B,NaN
foo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
one,1,2,3,x,y,peepee
two,4,5,6,q,w,poopoo


In [28]:
df_merge.shape

(2, 6)

In [24]:
df_merge[["baz"]].values.astype(np.float32)

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

In [25]:
torch.tensor(df_merge[["baz"]].values.astype(np.float32), dtype=torch.float32)

tensor([[1., 2., 3.],
        [4., 5., 6.]])

In [None]:
#06/27/25: Following is very much a work in progress. After debugging, we will copy this into training.py. 
class RVdataset(Dataset): 
    def __init__(self, time_id_list=None, stock_id_list=None, tab_features=None, ts_features=None, target="target", df_ts_feat=None, df_tab_feat=None, df_target=None):
        """
        :param time_id_list: Defaulted to None, in which case ALL time_id's will be included. A list containing the time_id's of interest.
        :param stock_id_list: Defaulted to None, in which case ALL stock_id's will be included. A list containing the stock_id's of interst. 
        :param tab_features: Defaulted to None, in which case NO feature will be included. A list containing the string of names of columns in df_tab_feat to be used as tabular features, for instance, the RV of current 10 mins bucket is a tabular feature. 
        :param ts_features: Defaulted to None, in which case NO feature will be included. A list containing the string of names of columns in df_ts_feat to be used as time series features, for instance, sub_int_RV in book_time created in data_processing_functions.ipynb. 
        :param target: Defaulted to "target". The string indicating how target is identified in column index of dataframe. 
        :param df_ts_feat: Defaulted to None. The dataframe containing the time series like features, must have "row_id" as identifier for rows and column "sub_int_num" as indicator of time series ordering. 
        :param df_tab_feat: Defaulted to None. The dataframe containing the tabluar features, must have "row_id" as identifier for rows. When df_target is not None, one should make sure there is no target in the df_tab_feat. 
        :param df_target: Defaulted to None, in which case, target will be searched in df_tab_feat instead and expects df_tab_feat to contain target column to be used as target. The dataframe containing the target stored in the target column, must have "row_id" to be used as identifier. 
        """
        super().__init__()
        #First case, no restriction on time and stock id 
        if ((time_id_list==None) & (stock_id_list==None)):
            #Import and pivot time series features 
            df_ts_pv=df_ts_feat.pivot(index="row_id", columns="sub_int_num", values=ts_features).dropna(axis="columns")
            #Import, add in the target, and pivot tabular features 
            df_tab_copy=df_tab_feat.copy(deep=True)
            if df_target!=None: 
                df_tab_copy=pd.merge(df_tab_copy,df_target,on="row_id")
            df_tab_copy["sub_int_num"]=np.nan 
            df_tab_pv=df_tab_copy.pivot(index="row_id", columns="sub_int_num", values=tab_features.append(target))
            del df_tab_copy 
            #Create the full dataframe 
            df_whole_pv_dna=pd.merge(df_ts_pv,df_tab_pv,on="row_id").dropna(axis="rows")
            del df_ts_pv
            del df_tab_pv
        #Second case, only resticting stock id 
        elif (time_id_list==None):
            #Import and pivot time series features 
            df_ts_pv=df_ts_feat[df_ts_feat["stock_id"].isin(stock_id_list)].pivot(index="row_id", columns="sub_int_num", values=ts_features).dropna(axis="columns")
            #Import, add in the target, and pivot tabular features 
            df_tab_copy=df_tab_feat[df_tab_feat["stock_id"].isin(stock_id_list)]
            if df_target!=None: 
                df_tab_copy=pd.merge(df_tab_copy,df_target[df_target["stock_id"].isin(stock_id_list)],on="row_id")
            df_tab_copy["sub_int_num"]=np.nan 
            df_tab_pv=df_tab_copy.pivot(index="row_id", columns="sub_int_num", values=tab_features.append(target))
            del df_tab_copy 
            #Create the full dataframe 
            df_whole_pv_dna=pd.merge(df_ts_pv,df_tab_pv,on="row_id").dropna(axis="rows")
            del df_ts_pv
            del df_tab_pv
        #Thrid case, only restricting time id 
        elif (stock_id_list==None): 
            #Import and pivot time series features 
            df_ts_pv=df_ts_feat[df_ts_feat["time_id"].isin(time_id_list)].pivot(index="row_id", columns="sub_int_num", values=ts_features).dropna(axis="columns")
            #Import, add in the target, and pivot tabular features 
            df_tab_copy=df_tab_feat[df_tab_feat["time_id"].isin(time_id_list)]
            if df_target!=None: 
                df_tab_copy=pd.merge(df_tab_copy,df_target[df_target["time_id"].isin(time_id_list)],on="row_id")
            df_tab_copy["sub_int_num"]=np.nan 
            df_tab_pv=df_tab_copy.pivot(index="row_id", columns="sub_int_num", values=tab_features.append(target))
            del df_tab_copy 
            #Create the full dataframe 
            df_whole_pv_dna=pd.merge(df_ts_pv,df_tab_pv,on="row_id").dropna(axis="rows")
            del df_ts_pv
            del df_tab_pv
        #Last, and forth, case, restricting both stock and time id
        else: 
            #Import and pivot time series features 
            df_ts_pv=df_ts_feat[(df_ts_feat["time_id"].isin(time_id_list))&(df_ts_feat["stock_id"].isin(stock_id_list))].pivot(index="row_id", columns="sub_int_num", values=ts_features).dropna(axis="columns")
            #Import, add in the target, and pivot tabular features 
            df_tab_copy=df_tab_feat[(df_tab_feat["time_id"].isin(time_id_list))&(df_tab_feat["stock_id"].isin(stock_id_list))]
            if df_target!=None: 
                df_tab_copy=pd.merge(df_tab_copy,df_target[(df_target["time_id"].isin(time_id_list))&(df_target["stock_id"].isin(stock_id_list))],on="row_id")
            df_tab_copy["sub_int_num"]=np.nan 
            df_tab_pv=df_tab_copy.pivot(index="row_id", columns="sub_int_num", values=tab_features.append(target))
            del df_tab_copy 
            #Create the full dataframe 
            df_whole_pv_dna=pd.merge(df_ts_pv,df_tab_pv,on="row_id").dropna(axis="rows")
            del df_ts_pv
            del df_tab_pv
        #Create object values 
        #The features, targets, and length
        self.features=torch.tensor(df_whole_pv_dna[ts_features+tab_features].values.astype(np.float32),dtype=torch.float32)
        self.target=torch.tensor(df_whole_pv_dna[target].values.astype(np.float32),dtype=torch.float32)
        self.len=df_whole_pv_dna.shape[0]
        #The record of feature positions 
        all_feat=ts_features+tab_features
        all_feat_len=[df_whole_pv_dna[feat].shape[1] for feat in all_feat]
        self.featuresplit=dict(zip(all_feat,all_feat_len))
        #Clean up
        del df_whole_pv_dna
        del all_feat 
        del all_feat_len
    def __getitem__(self, index):
        # return super().__getitem__(index)
        return self.features[index], self.target[index]
    def __len__(self):
        return self.len
        