# Trading Strategy for Finance using LSTMs

### Notebook Configurations and Packages

Let's execute the cell below to display information about the GPUs running on the server. First, we import several widely used modules such as NumPy for numerical calculations, pandas for data management, matplotlib for visualizations, and TensorFlow for building and training deep neural networks.

**Environment Verification**

In [6]:
!nvidia-smi

Mon Apr  8 10:39:59 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.43       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Quadro P5000        Off  | 00000000:03:00.0 Off |                  Off |
| 28%   42C    P0    42W / 180W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P5000        Off  | 00000000:04:00.0 Off |                  Off |
| 35%   49C    P0    38W / 180W |      0MiB / 16278MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                            

In [7]:
#imports
import h5py
import pandas as pd 
import numpy as np
import pprint as pp 
import tensorflow as tf 
from tensorflow.contrib import rnn
import math
import sklearn
import matplotlib.pyplot as plt
import warnings
from tradingcore import prepareData as prepData
from numpy.random import seed

In [8]:
seed(42)
tf.set_random_seed(42)
MAX_SEQUENCE_LENGTH = 32
EMBEDDING_DIM = 300

print("Keras version:",keras.__version__)
print("Tensorflow version:",tf.__version__)
print("Sklearn version:",sklearn.__version__)

NameError: name 'keras' is not defined

#### Data Preparation

A typical DL workflow starts with data preparation because the data is not clean and ready to use most of the time. Deep neural network building and training follow the data preparation. Lastly, the trained network is validated with a dataset. 

The original data needs to be cleaned before training the network. Since cleaning the data takes significant amount of time (around 20 minutes), we have stored the cleaned data into another .h5 file. If you would like to use the original data and run the cleaning code, please set the "usePreparedData" variable to "False".

In [None]:
# The data is prepared and stored in a seperate .h5 file.
# Set usePreparedData = False to use the original data and run the data preparation code
usePreparedData = True
# insampleCutoffTimestamp variable is used to split the data in time into two pieces to create training and test set.
insampleCutoffTimestamp = 1650

# If usePreparatedData is True, then the prepared data is stored. Otherwise, the original data is stored
if usePreparedData == True:
    #with pd.HDFStore("/home/mimas/2sigma/DLI_FSI/2sigma/train_prepared.h5", 'r') as train:
    with pd.HDFStore("data/algo_trading/trainDataPrepared.h5", 'r') as train:
        df = train.get("train") 
else:
    with pd.HDFStore("data/algo_trading/train.h5", 'r') as train:
        df = train.get("train")

There are multiple instruments in the dataset and each instrument has an id. Time is represented by the 'timestamp' feature. Let's look at the data.

In [None]:
# This will print the dataset
df

If the original data is stored, the data preparation code will be executed in the following cell. First, extreme values in each feature set are removed. Then, some hand-crafted features are added to feature set to boost the prediction accuracy. There are many methods including PCA and auto-encoders to do the feature engineering rather than creating hand-crafted features. As an exercise, we highly recommend you to add auto-encoders to the code and check the accuracy after the lab. Lastly, NaNs are replaced with the median of the feature.

In [None]:
if usePreparedData == False:
    # Original data is not clean and some the samples are a bit extreme.
    # These values are removed from the feature set.
    df = prepData.removeExtremeValues(df, insampleCutoffTimestamp)
    # A little bit feature engineering. Hand-crafted features are created here to boost the accuracy.
    df = prepData.createNewFeatures(df) 
    # Check whether ve still have any NaNs 
    df = prepData.fillNaNs(df) 
    df.to_hdf("data/algo_trading/trainDataPrepared.h5", 'train')
