# Portfolio Optimization using Deep Reinforcement Learning
---

## 6.0 Data Split
---

We will split both the close prices and the whole dataset into train and test (trade) data.

We will use 80% of the data for training and then test on the remaining 20%.

We will make use of the FinRL Library function of data_split to split our data into train and test

### 6.1 Import Relevant Libraries

In [34]:
import pandas as pd
import numpy as np
import ta
from ta import add_all_ta_features
from ta.utils import dropna
from finrl.meta.preprocessor.preprocessors import data_split
from finrl.meta.preprocessor.preprocessors import FeatureEngineer

### 6.2 Load the data

In [35]:
%store

Stored variables and their in-db values:
data_df                          ->              date            tic        close     
df                               ->              date            tic        close     
df_close_full_stocks             ->             date   HCLTECH.NS  EICHERMOT.NS  HINDA
filtered_stocks                  -> Index(['ITC.NS', 'NTPC.NS', 'HDFCBANK.NS', 'HINDUN


In [36]:
%store -r data_df
%store -r filtered_stocks
%store -r df_close_full_stocks

In [37]:
data_df.head()

Unnamed: 0,date,tic,close,high,low,open,volume,cov_list,f01,f02,f03,f04
0,2009-01-13,ASIANPAINT.NS,91.699997,88.5,91.235001,88.5,65800,"[[0.0005821350723573744, 0.0001385649017777150...",0.646335,0.708096,0.085455,2.997396
1,2009-01-13,CIPLA.NS,189.649994,184.0,185.350006,185.0,901712,"[[0.0005821350723573744, 0.0001385649017777150...",0.646335,0.708096,0.085455,2.997396
2,2009-01-13,DRREDDY.NS,478.0,448.0,452.75,465.75,544994,"[[0.0005821350723573744, 0.0001385649017777150...",0.646335,0.708096,0.085455,2.997396
3,2009-01-13,GAIL.NS,39.375019,37.875019,38.756268,38.60627,9334277,"[[0.0005821350723573744, 0.0001385649017777150...",0.646335,0.708096,0.085455,2.997396
4,2009-01-13,GRASIM.NS,209.852203,202.908554,204.891357,205.570282,1994905,"[[0.0005821350723573744, 0.0001385649017777150...",0.646335,0.708096,0.085455,2.997396


In [38]:
df_close_full_stocks.head()

Unnamed: 0,date,HCLTECH.NS,EICHERMOT.NS,HINDALCO.NS,INDUSINDBK.NS,GRASIM.NS,AXISBANK.NS,ONGC.NS,BRITANNIA.NS,BPCL.NS,...,POWERGRID.NS,TATAMOTORS.NS,UPL.NS,BAJAJFINSV.NS,ICICIBANK.NS,DIVISLAB.NS,TCS.NS,TECHM.NS,BAJFINANCE.NS,BHARTIARTL.NS
0,2008-01-01,83.612503,41.0,197.665359,131.0,589.437805,196.690002,209.149994,149.0,88.283333,...,83.221893,146.884857,119.633331,2630.0,225.454544,482.487488,269.25,289.9375,43.721157,450.628632
1,2008-01-02,81.474998,41.0,199.115448,132.199997,590.20929,208.990005,214.833328,151.960007,88.083336,...,83.081268,152.729584,125.98333,2629.0,236.309097,480.225006,265.25,287.212494,45.858635,442.832764
2,2008-01-03,79.1875,46.200001,200.656158,131.399994,580.951111,209.929993,224.083328,152.899994,91.666664,...,85.725021,156.370575,126.616669,2600.0,229.981812,480.975006,261.25,287.0,45.664318,434.789032
3,2008-01-04,79.237503,43.5,200.203018,135.0,563.823486,214.889999,226.0,153.199997,92.800003,...,87.750023,157.922775,129.133331,2604.699951,236.363632,481.25,255.725006,286.75,49.356327,432.107788
4,2008-01-07,79.0,42.400002,198.481033,134.5,559.194397,219.399994,223.666672,153.289993,88.333336,...,86.175018,154.799194,129.666672,2599.0,250.899994,474.674988,252.199997,278.75,49.356327,427.105804


In [39]:
# Close Prices data frame

# Reset the Index to tic and date
df_prices = data_df.reset_index().set_index(['tic', 'date']).sort_index()

# Get all the Close Prices
df_close = pd.DataFrame()

for ticker in filtered_stocks:
    series = df_prices.xs(ticker).close
    df_close[ticker] = series

In [40]:
data_df.columns

Index(['date', 'tic', 'close', 'high', 'low', 'open', 'volume', 'cov_list',
       'f01', 'f02', 'f03', 'f04'],
      dtype='object')

In [41]:
df_close.head()

Unnamed: 0_level_0,ITC.NS,NTPC.NS,HDFCBANK.NS,HINDUNILVR.NS,CIPLA.NS,GRASIM.NS,LT.NS,ASIANPAINT.NS,MARUTI.NS,RELIANCE.NS,POWERGRID.NS,SUNPHARMA.NS,WIPRO.NS,TCS.NS,DRREDDY.NS,INFY.NS,GAIL.NS,SBIN.NS,ICICIBANK.NS,HEROMOTOCO.NS
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2009-01-13,57.483334,142.875,101.199997,263.850006,189.649994,209.852203,317.688873,91.699997,595.0,261.956329,42.721886,114.885002,55.068752,134.25,478.0,155.300003,39.375019,118.629997,81.527275,814.0
2009-01-14,57.833332,148.25,104.470001,257.0,189.5,209.852203,319.733337,93.800003,604.0,278.414307,42.41251,114.800003,57.116253,139.399994,487.0,164.25,40.818771,121.144997,82.0,838.0
2009-01-15,56.599998,141.75,96.0,252.199997,187.5,205.994614,319.977783,93.790001,598.200012,265.14505,41.709385,112.489998,53.955002,130.75,480.0,157.987503,38.025021,117.5,78.0,840.0
2009-01-16,57.566666,150.833328,94.455002,252.5,184.850006,203.37146,324.399994,94.400002,597.900024,281.385895,43.003136,113.394997,54.675003,129.75,454.899994,159.237503,39.375019,117.660004,77.763634,849.900024
2009-01-19,58.283333,151.583328,95.449997,254.949997,185.0,200.563156,327.333344,93.574997,585.099976,285.717529,44.38126,113.894997,54.112503,129.475006,461.700012,159.09375,40.425018,118.300003,80.800003,846.0


In [42]:
df_close = df_close.reset_index()

### 6.3 Split the Data

In [43]:
# Define the start and end dates for the train and test data

train_pct = 0.8 # percentage of train data
date_list = list(data_df.date.unique()) # List of dates in the data

date_list_len = len(date_list) # len of the date list
train_data_len = int(train_pct * date_list_len) # length of the train data

train_start_date = date_list[0]
train_end_date = date_list[train_data_len]

test_start_date = date_list[train_data_len+1]
test_end_date = date_list[-1]

In [44]:
print('Training Data: ', 'from ', train_start_date, ' to ', train_end_date)

Training Data:  from  2009-01-13  to  2021-02-22


In [45]:
print('Testing Data: ', 'from ', test_start_date, ' to ', test_end_date)

Testing Data:  from  2021-02-23  to  2024-02-28


In [46]:
# Split the whole dataset
train_data = data_split(data_df, train_start_date, train_end_date)
test_data = data_split(data_df, test_start_date, test_end_date)

# Split the Close Prices dataset
prices_train_data = df_close[df_close['date']<=train_end_date]
prices_test_data = df_close[df_close['date']>=test_start_date]

# split the Close Prices of all stocks
prices_full_train = df_close_full_stocks[df_close_full_stocks['date']<=train_end_date]
prices_full_test = df_close_full_stocks[df_close_full_stocks['date']>=test_start_date]

### 6.4 Store the Dataframes

In [47]:
prices_train = prices_train_data.copy()
prices_test = prices_test_data.copy()

train_df = train_data.copy()
test_df = test_data.copy()

prices_full_train_df = prices_full_train.copy()
prices_full_test_df = prices_full_test.copy()

In [48]:
train_df.shape

(59660, 12)

In [49]:
test_df.shape

(14900, 12)

In [50]:
%store prices_train
%store prices_test

%store train_df
%store test_df

%store prices_full_train_df
%store prices_full_test_df

Stored 'prices_train' (DataFrame)
Stored 'prices_test' (DataFrame)
Stored 'train_df' (DataFrame)
Stored 'test_df' (DataFrame)
Stored 'prices_full_train_df' (DataFrame)
Stored 'prices_full_test_df' (DataFrame)
