## Creating datasets.

We will learn by doing while reading through chapter 10 of the pytorch book. The book focused far too much on the specific example and ignored introducing the abstract necessary components (other than the init, getitem and len) of efficiently constructing dataset to us. Our dataset structure will mimic the book's, while differ in detail, as there might be things that are simply not important to us. The following will be updated VERY FREQUENTLY, as this is very much a learning process. 

Due to the fact that the creation of raw feature can be time-consuming, we should offer chance of loading preexisting features, and make sure the feature creation is separate from the dataset. Due to the fact that the raw dataframe storing features can be unnecessarily large, we prob should not commit memory in dataset object to store the dataframe itself. Notice, there might be condition required upon the data, for instance, time_id and stock_id; and there might be different kinds of data, like simple tabular features, or timeseries like features. 

In [1]:
import torch 
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import copy

Pivot will be our best friend. 

Following is an example of how pivot works. 

In [27]:
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',

                           'two'],

                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],

                   'baz': [1, 2, 3, 4, 5, 6],

                   'zoo': ['x', 'y', np.nan, 'q', 'w', np.nan]})

In [28]:
df

Unnamed: 0,foo,bar,baz,zoo
0,one,A,1,x
1,one,B,2,y
2,one,C,3,
3,two,A,4,q
4,two,B,5,w
5,two,C,6,


In [29]:
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])

Unnamed: 0_level_0,baz,baz,baz,zoo,zoo,zoo
bar,A,B,C,A,B,C
foo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
one,1,2,3,x,y,
two,4,5,6,q,w,


In [30]:
df_pv_dna=df.pivot(index='foo', columns='bar', values=['baz', 'zoo']).dropna(axis="columns")

In [31]:
df_pv_dna

Unnamed: 0_level_0,baz,baz,baz,zoo,zoo
bar,A,B,C,A,B
foo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
one,1,2,3,x,y
two,4,5,6,q,w


In [32]:
df_1=pd.DataFrame({"foo":["one","two"],"new":["peepee","poopoo"],"bar":[np.nan,np.nan]})

In [33]:
df_1=df_1.pivot(index="foo",columns="bar",values=["new"])

In [34]:
df_1

Unnamed: 0_level_0,new
bar,NaN
foo,Unnamed: 1_level_2
one,peepee
two,poopoo


In [35]:
df_merge=pd.merge(df_pv_dna,df_1,on="foo")

In [36]:
df_merge

Unnamed: 0_level_0,baz,baz,baz,zoo,zoo,new
bar,A,B,C,A,B,NaN
foo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
one,1,2,3,x,y,peepee
two,4,5,6,q,w,poopoo


In [37]:
df_merge.loc[:,"zoo"]

bar,A,B
foo,Unnamed: 1_level_1,Unnamed: 2_level_1
one,x,y
two,q,w


In [38]:
df_merge.loc[:,"baz"].shape

(2, 3)

In [39]:
df_merge.dropna(axis="rows")

Unnamed: 0_level_0,baz,baz,baz,zoo,zoo,new
bar,A,B,C,A,B,NaN
foo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
one,1,2,3,x,y,peepee
two,4,5,6,q,w,poopoo


In [28]:
df_merge.shape

(2, 6)

In [24]:
df_merge[["baz"]].values.astype(np.float32)

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

In [25]:
torch.tensor(df_merge[["baz"]].values.astype(np.float32), dtype=torch.float32)

tensor([[1., 2., 3.],
        [4., 5., 6.]])

In [2]:
#06/27/25: Following is very much a work in progress. After debugging, we will copy this into training.py. 
class RVdataset(Dataset): 
    def __init__(self, time_id_list=None, stock_id_list=None, tab_features=None, ts_features=None, target="target", df_ts_feat=None, df_tab_feat=None, df_target=None):
        """
        Object in subclass of Dataset. 
        
        :param time_id_list: Defaulted to None, in which case ALL time_id's will be included. A list containing the time_id's of interest. If the value is not None, it is expected that "time_id" column (with values type int) is present in all input dataframes. 
        :param stock_id_list: Defaulted to None, in which case ALL stock_id's will be included. A list containing the stock_id's of interst. If the value is not None, it is expected that "stock_id" column (with values type int) is present in all input dataframes. 
        :param tab_features: Defaulted to None, in which case NO feature will be included. A list containing the string of names of columns in df_tab_feat to be used as tabular features, for instance, the RV of current 10 mins bucket is a tabular feature. 
        :param ts_features: Defaulted to None, in which case NO feature will be included. A list containing the string of names of columns in df_ts_feat to be used as time series features, for instance, sub_int_RV in book_time created in data_processing_functions.ipynb. 
        :param target: Defaulted to "target". The string indicating how target is identified in column index of dataframe. 
        :param df_ts_feat: Defaulted to None. The dataframe containing the time series like features, must have "row_id" as identifier for rows and column "sub_int_num" as indicator of time series ordering. 
        :param df_tab_feat: Defaulted to None. The dataframe containing the tabluar features, must have "row_id" as identifier for rows. When df_target is not None, one should make sure there is no target in the df_tab_feat. 
        :param df_target: Defaulted to None, in which case, target will be searched in df_tab_feat instead and expects df_tab_feat to contain target column to be used as target. The dataframe containing the target stored in the target column, must have "row_id" to be used as identifier. 
        
        Object attributes: 
        
            self.features: The collection of features as a torch tensor. 
            self.target: The collection of targets as a torch tensor. 
            self.len: The length of the whole dataset object. 
            self.featuresplit: A dictionary in form of {feature name:length of feature, ...} to help distinguish different features in the feature torch tensor. The length of feature is included since some of the features are time series, while some are tabular. 
            
        Object methods: 

            self.__init__(): Initialize the object. 
            self.__getitem__(): Returns a feature and a target both as torch tensors, in this order, of a candidate. 
            self.__len__(): Returns the length of the dataset object. 
        """
        super().__init__()
        #First case, no restriction on time and stock id 
        if ((time_id_list==None) & (stock_id_list==None)):
            #Import and pivot time series features 
            df_ts_pv=df_ts_feat.pivot(index="row_id", columns="sub_int_num", values=ts_features).dropna(axis="columns")
            #Import, add in the target, and pivot tabular features 
            df_tab_copy=df_tab_feat.copy(deep=True)
            if not df_target is None: 
                df_tab_copy=pd.merge(df_tab_copy,df_target,on="row_id")
            df_tab_copy["sub_int_num"]=np.nan 
            feat_tar=tab_features+[target]
            df_tab_pv=df_tab_copy.pivot(index="row_id", columns="sub_int_num", values=feat_tar)
            del df_tab_copy 
            #Create the full dataframe 
            df_whole_pv_dna=pd.merge(df_ts_pv,df_tab_pv,on="row_id").dropna(axis="rows")
            del df_ts_pv
            del df_tab_pv
            del feat_tar
        #Second case, only resticting stock id 
        elif (time_id_list==None):
            #Import and pivot time series features 
            df_ts_pv=df_ts_feat[df_ts_feat["stock_id"].isin(stock_id_list)].pivot(index="row_id", columns="sub_int_num", values=ts_features).dropna(axis="columns")
            #Import, add in the target, and pivot tabular features 
            df_tab_copy=df_tab_feat[df_tab_feat["stock_id"].isin(stock_id_list)]
            if not df_target is None: 
                df_tab_copy=pd.merge(df_tab_copy,df_target[df_target["stock_id"].isin(stock_id_list)],on="row_id")
            df_tab_copy["sub_int_num"]=np.nan 
            feat_tar=tab_features+[target]
            df_tab_pv=df_tab_copy.pivot(index="row_id", columns="sub_int_num", values=feat_tar)
            del df_tab_copy 
            #Create the full dataframe 
            df_whole_pv_dna=pd.merge(df_ts_pv,df_tab_pv,on="row_id").dropna(axis="rows")
            del df_ts_pv
            del df_tab_pv
            del feat_tar
        #Thrid case, only restricting time id 
        elif (stock_id_list==None): 
            #Import and pivot time series features 
            df_ts_pv=df_ts_feat[df_ts_feat["time_id"].isin(time_id_list)].pivot(index="row_id", columns="sub_int_num", values=ts_features).dropna(axis="columns")
            #Import, add in the target, and pivot tabular features 
            df_tab_copy=df_tab_feat[df_tab_feat["time_id"].isin(time_id_list)]
            if not df_target is None: 
                df_tab_copy=pd.merge(df_tab_copy,df_target[df_target["time_id"].isin(time_id_list)],on="row_id")
            df_tab_copy["sub_int_num"]=np.nan 
            feat_tar=tab_features+[target]
            df_tab_pv=df_tab_copy.pivot(index="row_id", columns="sub_int_num", values=feat_tar)
            del df_tab_copy 
            #Create the full dataframe 
            df_whole_pv_dna=pd.merge(df_ts_pv,df_tab_pv,on="row_id").dropna(axis="rows")
            del df_ts_pv
            del df_tab_pv
            del feat_tar
            # print(df_whole_pv_dna.columns)
        #Last, and forth, case, restricting both stock and time id
        else: 
            #Import and pivot time series features 
            df_ts_pv=df_ts_feat[(df_ts_feat["time_id"].isin(time_id_list))&(df_ts_feat["stock_id"].isin(stock_id_list))].pivot(index="row_id", columns="sub_int_num", values=ts_features).dropna(axis="columns")
            #Import, add in the target, and pivot tabular features 
            df_tab_copy=df_tab_feat[(df_tab_feat["time_id"].isin(time_id_list))&(df_tab_feat["stock_id"].isin(stock_id_list))]
            if not df_target is None:
                df_tab_copy=pd.merge(df_tab_copy,df_target[(df_target["time_id"].isin(time_id_list))&(df_target["stock_id"].isin(stock_id_list))],on="row_id")
            df_tab_copy["sub_int_num"]=np.nan 
            feat_tar=tab_features+[target]
            df_tab_pv=df_tab_copy.pivot(index="row_id", columns="sub_int_num", values=feat_tar)
            del df_tab_copy 
            #Create the full dataframe 
            df_whole_pv_dna=pd.merge(df_ts_pv,df_tab_pv,on="row_id").dropna(axis="rows")
            del df_ts_pv
            del df_tab_pv
            del feat_tar
        #Create object values 
        #The features, targets, and length
        all_feat=ts_features+tab_features
        self.features=torch.tensor(df_whole_pv_dna.loc[:,all_feat].values.astype(np.float32),dtype=torch.float32)
        self.target=torch.tensor(df_whole_pv_dna.loc[:,target].values.astype(np.float32),dtype=torch.float32)
        self.len=df_whole_pv_dna.shape[0]
        #The record of feature positions 
        all_feat_len=[df_whole_pv_dna[feat].shape[1] for feat in all_feat]
        self.featuresplit=dict(zip(all_feat,all_feat_len))
        #Clean up
        del df_whole_pv_dna
        del all_feat 
        del all_feat_len
    def __getitem__(self, index):
        # return super().__getitem__(index)
        return self.features[index], self.target[index]
    def __len__(self):
        return self.len
        

## An example 

We load a precalculated timeseries like feature. 

In [3]:
book_time=pd.read_parquet("../processed_data/book_RV_ts_60_si.parquet")

In [4]:
book_time

Unnamed: 0,time_id,sub_int_RV,sub_int_num,stock_id,row_id
0,5,0.000015,1,0,0-5
1,11,0.000004,1,0,0-11
2,16,0.000432,1,0,0-16
3,31,0.000000,1,0,0-31
4,62,0.000235,1,0,0-62
...,...,...,...,...,...
25735915,6410,0.000000,60,99,99-6410
25735916,10421,0.000000,60,99,99-10421
25735917,25639,0.000000,60,99,99-25639
25735918,25680,0.000000,60,99,99-25680


We add in a new (fake) time series feature for sake of example. 

In [5]:
book_time["fake_ts"]=1

In [6]:
book_time

Unnamed: 0,time_id,sub_int_RV,sub_int_num,stock_id,row_id,fake_ts
0,5,0.000015,1,0,0-5,1
1,11,0.000004,1,0,0-11,1
2,16,0.000432,1,0,0-16,1
3,31,0.000000,1,0,0-31,1
4,62,0.000235,1,0,0-62,1
...,...,...,...,...,...,...
25735915,6410,0.000000,60,99,99-6410,1
25735916,10421,0.000000,60,99,99-10421,1
25735917,25639,0.000000,60,99,99-25639,1
25735918,25680,0.000000,60,99,99-25680,1


We load in precalculated tabular features. 

In [7]:
RV_train=pd.read_csv("../processed_data/RV_by_row_id.csv")

In [8]:
RV_train

Unnamed: 0,row_id,RV
0,0-5,0.004499
1,0-11,0.001204
2,0-16,0.002369
3,0-31,0.002574
4,0-62,0.001894
...,...,...
428927,99-32751,0.001436
428928,99-32753,0.001795
428929,99-32758,0.001658
428930,99-32763,0.002213


We load in targets.

In [9]:
target_train=pd.read_csv("../raw_data/kaggle_ORVP/train.csv")

In [10]:
target_train["row_id"]=target_train["stock_id"].astype(int).astype(str)+"-"+target_train["time_id"].astype(int).astype(str)

In [11]:
target_train

Unnamed: 0,stock_id,time_id,target,row_id
0,0,5,0.004136,0-5
1,0,11,0.001445,0-11
2,0,16,0.002168,0-16
3,0,31,0.002195,0-31
4,0,62,0.001747,0-62
...,...,...,...,...
428927,126,32751,0.003461,126-32751
428928,126,32753,0.003113,126-32753
428929,126,32758,0.004070,126-32758
428930,126,32763,0.003357,126-32763


Try out our RVdataset class. 

In [12]:
merged=pd.merge(RV_train,target_train,on="row_id")

In [13]:
merged["sub_int_num"]=np.nan

In [14]:
merged.pivot(index="row_id", columns="sub_int_num", values=["RV","target"])

Unnamed: 0_level_0,RV,target
sub_int_num,NaN,NaN
row_id,Unnamed: 1_level_2,Unnamed: 2_level_2
0-1000,0.001731,0.001348
0-10000,0.002863,0.001805
0-10005,0.008673,0.007544
0-10017,0.014300,0.011218
0-10030,0.002503,0.002854
...,...,...
99-9972,0.001629,0.001768
99-9973,0.009243,0.009511
99-9976,0.005455,0.003722
99-9988,0.001239,0.001398


Below takes 6.7s on my computer. 

In [15]:
RVds_ex=RVdataset(tab_features=["RV"],ts_features=["sub_int_RV","fake_ts"],df_ts_feat=book_time,df_tab_feat=RV_train,df_target=target_train)

Below is an example.

A call of getitem. 

In [16]:
RVds_ex.__getitem__(100)

(tensor([2.7296e-04, 6.4961e-04, 0.0000e+00, 1.6329e-03, 6.7091e-04, 1.2204e-05,
         1.1921e-03, 3.0587e-03, 2.2691e-03, 2.7618e-03, 2.8222e-05, 0.0000e+00,
         3.1516e-05, 9.5364e-04, 0.0000e+00, 1.5600e-03, 6.7393e-04, 4.9002e-04,
         3.6418e-04, 4.4144e-03, 3.4815e-04, 4.1370e-05, 3.2925e-03, 8.1821e-04,
         1.7479e-03, 2.2020e-04, 7.0674e-05, 4.8743e-03, 1.3331e-04, 6.7386e-05,
         7.0188e-05, 2.2675e-03, 1.7251e-03, 7.4227e-04, 7.2390e-04, 6.0722e-04,
         4.6948e-05, 5.7990e-04, 1.2781e-04, 4.3994e-04, 1.8528e-03, 2.9427e-04,
         5.0164e-05, 3.4550e-05, 1.5201e-03, 0.0000e+00, 0.0000e+00, 1.0583e-04,
         0.0000e+00, 9.9147e-04, 1.1185e-03, 1.0217e-03, 6.6938e-04, 1.0001e-03,
         0.0000e+00, 5.9121e-05, 4.5511e-05, 4.1519e-04, 6.3065e-04, 1.1609e-03,
         1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00,
         1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00,
         1.0000e+00, 1.0000e

A call of len. 

In [17]:
RVds_ex.len

428932

In [18]:
RVds_ex.__len__()

428932

A call of featuresplit that indicates name of each feature and their length. 

In [19]:
RVds_ex.featuresplit

{'sub_int_RV': 60, 'fake_ts': 60, 'RV': 1}

Example with restriction on time id.

In [20]:
RV_train["time_id"]=RV_train["row_id"].apply(lambda x: int(x.split("-")[1]))

In [21]:
RV_train

Unnamed: 0,row_id,RV,time_id
0,0-5,0.004499,5
1,0-11,0.001204,11
2,0-16,0.002369,16
3,0-31,0.002574,31
4,0-62,0.001894,62
...,...,...,...
428927,99-32751,0.001436,32751
428928,99-32753,0.001795,32753
428929,99-32758,0.001658,32758
428930,99-32763,0.002213,32763


In [22]:
RVds_ex_time_5=RVdataset(time_id_list=[5],tab_features=["RV"],ts_features=["sub_int_RV","fake_ts"],df_ts_feat=book_time,df_tab_feat=RV_train,df_target=target_train)

In [23]:
RVds_ex_time_5.featuresplit

{'sub_int_RV': 60, 'fake_ts': 60, 'RV': 1}

Example with restriction on stock id. 

In [24]:
RV_train["stock_id"]=RV_train["row_id"].apply(lambda x: int(x.split("-")[0]))

In [25]:
RV_train

Unnamed: 0,row_id,RV,time_id,stock_id
0,0-5,0.004499,5,0
1,0-11,0.001204,11,0
2,0-16,0.002369,16,0
3,0-31,0.002574,31,0
4,0-62,0.001894,62,0
...,...,...,...,...
428927,99-32751,0.001436,32751,99
428928,99-32753,0.001795,32753,99
428929,99-32758,0.001658,32758,99
428930,99-32763,0.002213,32763,99


In [26]:
RVds_ex_stock_0=RVdataset(stock_id_list=[0],tab_features=["RV"],ts_features=["sub_int_RV","fake_ts"],df_ts_feat=book_time,df_tab_feat=RV_train,df_target=target_train)

In [27]:
RVds_ex_stock_0.len

3830

Example of restriction on both stock and time id. 

In [28]:
RVds_ex_99_9976=RVdataset(time_id_list=[9976],stock_id_list=[99],tab_features=["RV"],ts_features=["sub_int_RV","fake_ts"],df_ts_feat=book_time,df_tab_feat=RV_train,df_target=target_train)

In [29]:
RVds_ex_99_9976.len

1