# Portfolio Optimization using Deep Reinforcement Learning
---

## 6.0 Data Split
---

We will split both the close prices and the whole dataset into train and test (trade) data.

We will use 80% of the data for training and then test on the remaining 20%.

We will make use of the FinRL Library function of data_split to split our data into train and test

### 6.1 Import Relevant Libraries

In [1]:
import pandas as pd
import numpy as np
import ta
from ta import add_all_ta_features
from ta.utils import dropna
from finrl.meta.preprocessor.preprocessors import data_split
from finrl.meta.preprocessor.preprocessors import FeatureEngineer

### 6.2 Load the data

In [2]:
%store

Stored variables and their in-db values:
data_df                          ->              date            tic        close     
df                               ->              date            tic        close     
df_close_full_stocks             ->             date  BRITANNIA.NS  AXISBANK.NS  TATAS
filtered_stocks                  -> Index(['NESTLEIND.NS', 'HDFCBANK.NS', 'HINDUNILVR.


In [3]:
%store -r data_df
%store -r filtered_stocks
%store -r df_close_full_stocks

In [4]:
data_df.head()

Unnamed: 0,date,tic,close,high,low,open,volume,cov_list,f01,f02,f03,f04
0,2019-01-09,ASIANPAINT.NS,1414.0,1397.150024,1402.0,1402.5,973687,"[[0.00015852269746834745, 1.720325697910041e-0...",1.47337,0.150646,0.0,4.06108
1,2019-01-09,BAJAJ-AUTO.NS,2710.0,2672.5,2696.899902,2702.5,285560,"[[0.00015852269746834745, 1.720325697910041e-0...",1.47337,0.150646,0.0,4.06108
2,2019-01-09,GRASIM.NS,840.956177,823.226685,831.991821,836.673218,4021049,"[[0.00015852269746834745, 1.720325697910041e-0...",1.47337,0.150646,0.0,4.06108
3,2019-01-09,HCLTECH.NS,474.274994,466.149994,469.200012,473.5,2471720,"[[0.00015852269746834745, 1.720325697910041e-0...",1.47337,0.150646,0.0,4.06108
4,2019-01-09,HDFCBANK.NS,1060.675049,1051.300049,1058.400024,1059.0,4284314,"[[0.00015852269746834745, 1.720325697910041e-0...",1.47337,0.150646,0.0,4.06108


In [5]:
df_close_full_stocks.head()

Unnamed: 0,date,BRITANNIA.NS,AXISBANK.NS,TATASTEEL.NS,HEROMOTOCO.NS,TATACONSUM.NS,HDFCLIFE.NS,GRASIM.NS,JSWSTEEL.NS,BAJAJ-AUTO.NS,...,BHARTIARTL.NS,NESTLEIND.NS,APOLLOHOSP.NS,BAJFINANCE.NS,INFY.NS,ONGC.NS,SUNPHARMA.NS,POWERGRID.NS,INDUSINDBK.NS,BAJAJFINSV.NS
0,2018-01-01,2387.0,569.799988,70.064255,3810.0,317.799988,399.799988,1165.166992,271.700012,3345.050049,...,484.966522,790.159973,1216.0,1760.0,522.25,195.699997,585.400024,113.484406,1655.949951,530.875
1,2018-01-02,2371.125,568.599976,69.668869,3784.350098,314.950012,398.0,1149.529175,268.0,3348.0,...,480.279999,792.304993,1212.0,1739.699951,521.0,197.5,582.0,113.512527,1647.0,522.900024
2,2018-01-03,2354.975098,565.450012,70.264328,3764.399902,315.149994,398.299988,1144.798096,272.25,3310.199951,...,473.610687,793.5,1203.949951,1738.349976,515.799988,197.399994,578.950012,113.878151,1650.0,517.97998
3,2018-01-04,2344.449951,565.0,72.736679,3759.949951,313.5,401.950012,1170.844482,281.899994,3274.25,...,475.142822,790.219971,1188.449951,1758.25,510.5,200.0,583.950012,114.103149,1652.5,513.5
4,2018-01-05,2339.75,566.0,74.018112,3758.699951,315.850006,412.0,1214.172241,289.899994,3294.0,...,488.932068,788.400024,1199.0,1821.0,513.200012,200.949997,587.349976,113.653152,1703.0,514.494995


In [6]:
# Close Prices data frame

# Reset the Index to tic and date
df_prices = data_df.reset_index().set_index(['tic', 'date']).sort_index()

# Get all the Close Prices
df_close = pd.DataFrame()

for ticker in filtered_stocks:
    series = df_prices.xs(ticker).close
    df_close[ticker] = series

In [7]:
data_df.columns

Index(['date', 'tic', 'close', 'high', 'low', 'open', 'volume', 'cov_list',
       'f01', 'f02', 'f03', 'f04'],
      dtype='object')

In [8]:
df_close.head()

Unnamed: 0_level_0,NESTLEIND.NS,HDFCBANK.NS,HINDUNILVR.NS,TCS.NS,KOTAKBANK.NS,HCLTECH.NS,WIPRO.NS,ASIANPAINT.NS,MARUTI.NS,TITAN.NS,NTPC.NS,BAJAJ-AUTO.NS,RELIANCE.NS,POWERGRID.NS,LT.NS,ICICIBANK.NS,GRASIM.NS,ITC.NS,SBILIFE.NS,HDFCLIFE.NS
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2019-01-09,1129.839966,1060.675049,1797.0,1919.0,1244.0,474.274994,249.450058,1414.0,7529.0,955.349976,123.708336,2710.0,1021.309631,111.796906,1393.550049,383.549988,840.956177,291.700012,649.75,398.450012
2019-01-10,1144.780029,1061.900024,1799.099976,1905.0,1242.0,471.5,248.287567,1400.0,7511.0,969.400024,124.625,2729.0,1015.823669,111.065651,1398.900024,382.149994,834.133301,293.700012,648.400024,409.350006
2019-01-11,1145.800049,1062.0,1795.400024,1875.0,1229.199951,470.5,247.425064,1406.199951,7400.0,969.5,125.25,2738.699951,1018.383789,110.193779,1393.900024,381.200012,829.352295,296.0,643.0,409.399994
2019-01-14,1138.569946,1054.949951,1780.75,1851.0,1224.449951,472.0,248.325058,1410.599976,7433.350098,961.349976,123.625,2747.199951,1006.223145,109.321899,1367.199951,378.700012,814.760315,294.799988,636.650024,405.049988
2019-01-15,1134.0,1063.25,1793.400024,1869.349976,1220.0,477.950012,249.225067,1414.449951,7470.0,970.700012,122.041664,2734.800049,1035.024658,111.937531,1348.0,375.399994,818.595093,297.399994,631.150024,393.700012


In [9]:
df_close = df_close.reset_index()

### 6.3 Split the Data

In [10]:
# Define the start and end dates for the train and test data

train_pct = 0.8 # percentage of train data
date_list = list(data_df.date.unique()) # List of dates in the data

date_list_len = len(date_list) # len of the date list
train_data_len = int(train_pct * date_list_len) # length of the train data

train_start_date = date_list[0]
train_end_date = date_list[train_data_len]

test_start_date = date_list[train_data_len+1]
test_end_date = date_list[-1]

In [11]:
print('Training Data: ', 'from ', train_start_date, ' to ', train_end_date)

Training Data:  from  2019-01-09  to  2023-02-15


In [12]:
print('Testing Data: ', 'from ', test_start_date, ' to ', test_end_date)

Testing Data:  from  2023-02-16  to  2024-02-27


In [13]:
# Split the whole dataset
train_data = data_split(data_df, train_start_date, train_end_date)
test_data = data_split(data_df, test_start_date, test_end_date)

# Split the Close Prices dataset
prices_train_data = df_close[df_close['date']<=train_end_date]
prices_test_data = df_close[df_close['date']>=test_start_date]

# split the Close Prices of all stocks
prices_full_train = df_close_full_stocks[df_close_full_stocks['date']<=train_end_date]
prices_full_test = df_close_full_stocks[df_close_full_stocks['date']>=test_start_date]

### 6.4 Store the Dataframes

In [14]:
prices_train = prices_train_data.copy()
prices_test = prices_test_data.copy()

train_df = train_data.copy()
test_df = test_data.copy()

prices_full_train_df = prices_full_train.copy()
prices_full_test_df = prices_full_test.copy()

In [15]:
train_df.shape

(20300, 12)

In [16]:
test_df.shape

(5040, 12)

In [17]:
%store prices_train
%store prices_test

%store train_df
%store test_df

%store prices_full_train_df
%store prices_full_test_df

Stored 'prices_train' (DataFrame)
Stored 'prices_test' (DataFrame)
Stored 'train_df' (DataFrame)
Stored 'test_df' (DataFrame)
Stored 'prices_full_train_df' (DataFrame)
Stored 'prices_full_test_df' (DataFrame)
