# Machine Trading

The purpose is to create a super-class that combines scikit-learn + statsmodels in order to get automatically a wide variety of models without the need of resorting to the original, sometimes complicated, commands. 

---
## Contents

1. [The Basics of Algorithmic Trading](#Basics)
2. [Factor Models](#FM)
3. [Time-Series Analysis](#TSAnalysis)
4. [Artificial Intelligence Techniques](#AI)

 4.1 [Classes and Functions](#ClassCh4)
 
 4.2 [10-fold CV OLS](#CVOLS)
 
 4.3 [SVM](#SVM)
 
 4.4 [Neural Network](#NN)

5. [Options Strategies](#Options)
6. [Intraday Trading and Market Microstructure](#IDT)
7. [Bitcoins](#BTC)
8. [Algorithmic Trading Is Good for Body and Soul](#Conclusion)

---
## Libraries

In [14]:
import numpy as np
import pandas as pd
from sklearn import datasets

import statsmodels.api as sm
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import TimeSeriesSplit
from sklearn import metrics

from sklearn import svm

import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

# Test dataset
iris = datasets.load_iris()

---
## Capter 4 - Artificial Intelligence Techniques <a id="AI"></a>

The following models will be used

* CV - OLS: Pending work to include summary on the class
* FUNCTION: An all-around function for various classification + regression models
* SVM: Pending work to add SVM into a super-classification class tailored to our needs
* NN: Tensorflow Simple model, pending to create a class tailored to our needs

### Chapter Classes & Functions <a id="ClassCh4"></a>

* CLASS: Statsmodels OLS + Scikit-learn CV:

In [45]:
#!!!! NEED TO INCLUDE SUMMARY INSIDE THE CLASS IN ORDER TO MAKE IT COMPLETE
class LinModels(BaseEstimator, RegressorMixin):
    """ 
    Create a Scikit-learn wrapper in order to apply 
    cross-validation methods on the statsmodels.
    
    The reason is that statsmodels offer R-style results,
    whereas Scikit-learn does not. 
    
    Parameters
    ----------
    X: Independent variables
    y: Dependent variable
    
    """
    def __init__(self, model_class, fit_intercept=True):
        self.model_class = model_class
        self.fit_intercept = fit_intercept
        
    def fit(self, X, y):
        if self.fit_intercept:
            X = sm.add_constant(X)
        self.model_ = self.model_class(y, X)
        self.results_ = self.model_.fit()
        
    def predict(self, X):
        if self.fit_intercept:
            X = sm.add_constant(X)
        return self.results_.predict(X)

* FUNCTION: The below function has the purpose of being an all-around algorithm for classification/clustering/regression. 

In [37]:
def CrossValidation(model_type, X_data, y_data, scoring=None, folds=10, cv_type="standard", stratify=None, 
                    shuffle=False, test_size=0.3, random_state=1234):
    """
    A flexible all-around cross validation function that can be applied 
    across a variety of models
    ---
    
    Imports needed: 
    * from sklearn.model_selection import TimeSeriesSplit
    * from sklearn.model_selection import train_test_split
    * from sklearn import metrics
    ---
    
    Inputs:
    * model_type:   The type of the model (cluster, regression, classification)
    * X_data:       Input, independent variables
    * y_data:       The responce variable
    * scoring:      The estimator score method
    * folds:        The number of k-folds for the CV process
    * cv_type:      Random CV selection ("standard") or time-series selection ("time_series")
    * stratify:     For class innequality in the train/test split
    * shuffle:      Shuffle the data before the train/test split
    * test_size:    The % size of the test data
    * random_state: A random key for replicating the split
    """
    # Split the dataset first
    if shuffle==False:
        print("If 'shuffle==False' then 'stratify=None'")
        stratify=None
        
    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=test_size, 
                                                        random_state=random_state, stratify=stratify, 
                                                       shuffle=shuffle)
    print("The train/test dataset sizes are:")
    print(X_train.shape, y_train.shape)
    print(X_test.shape, y_test.shape)
    print("")
    
    # Check the type of the model in order to use the appropriate scores
    if is_classifier(model_type):
        if any(scoring in s for s in ['accuracy', 'balanced_accuracy', 'f1', 
                                      'precision', 'roc_auc', 'recall', 'f1_micro', 
                                     'f1_macro', 'f1_weighted']):
            scoring = scoring
        else:
            print("The scoring parameter turned to default for classes, 'precision'")
            scoring = 'precision'
            
    elif is_regressor(model_type):
        if any(scoring in s for s in ['explained_variance', 'max_error', 'neg_mean_absolute_error', 
                         'neg_mean_squared_error', 'neg_mean_squared_log_error', 'r2']):
            scoring = scoring
        else:
            print("The scoring parameter turned to default for regressors, 'r2'")
            scoring = 'r2'
        
       
    # Perform Cross Validation
    if cv_type=="time_series":
        cv = TimeSeriesSplit(n_splits=folds).split(y_train)
        score = cross_val_score(model_type, 
                         X_train, y_train, cv=cv, scoring=scoring)
        predictions = cross_val_predict(model_type, X_test, y_test, cv=cv)
        print("Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))
        
    elif cv_type=="standard":
        score = cross_val_score(model_type, 
                         X_train, y_train, cv=folds, scoring=scoring)
        predictions = cross_val_predict(model_type, X_test, y_test, cv=folds)
        print("Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))
            
    return(score, predictions)
            

Run an example of the above function on various models. Checks where it fails

In [35]:
import pandas as pd
from sklearn import datasets, linear_model, ensemble, metrics
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
from sklearn.base import is_classifier, is_regressor
from matplotlib import pyplot as plt

In [38]:
lm = linear_model.LinearRegression()
rfc = ensemble.RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
rfr = ensemble.RandomForestRegressor()

CrossValidation(model_type=lm, X_data=iris.data, y_data=iris.target, scoring='precision')
CrossValidation(model_type=rfc, X_data=iris.data, y_data=iris.target, scoring='f1_macro')
CrossValidation(model_type=rfr, X_data=iris.data, y_data=iris.target, scoring='r2')

If 'shuffle==False' then 'stratify=None'
The train/test dataset sizes are:
(105, 4) (105,)
(45, 4) (45,)

The scoring parameter turned to default for regressors, 'r2'
Accuracy: 0.12 (+/- 0.59)
If 'shuffle==False' then 'stratify=None'
The train/test dataset sizes are:
(105, 4) (105,)
(45, 4) (45,)



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Accuracy: 0.93 (+/- 0.29)
If 'shuffle==False' then 'stratify=None'
The train/test dataset sizes are:
(105, 4) (105,)
(45, 4) (45,)

Accuracy: 0.60 (+/- 1.33)




(array([ 1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  1., -1.]),
 array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]))

Create a class when the above function is finilised

In [None]:
class FOXModels():
    """
    Work in progress
    """
    def __init__(self, )

* NN Tensorflow

In [91]:
class IrisClassifier(Model):
    """ 
    Create an Iris neural network classifier
    using Tensorflow 2.0 and Keras
    
    Parameters
    ----------
    X: Independent variables
    y: Dependent variable
    
    """
    def __init__(self):
        super(IrisClassifier, self).__init__()
        self.layer1 = Dense(10, activation='relu')
        self.layer2 = Dense(10, activation='relu')
        self.outputLayer = Dense(3, activation='softmax')
    
    def call(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        return self.outputLayer(x)

### 10-fold CV Linear Regression <a id="CVOLS"></a>

A simple example of using 10-fold CV on statsmodels regression

In [54]:
scores = cross_val_score(LinModels(sm.OLS), iris.data, iris.target, scoring='r2', cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.16 (+/- 0.64)


In [57]:
mod = sm.OLS(iris.target, iris.data)
res = mod.fit()
print(res.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.972
Model:                            OLS   Adj. R-squared (uncentered):              0.971
Method:                 Least Squares   F-statistic:                              1267.
Date:                Sun, 03 Nov 2019   Prob (F-statistic):                   3.17e-112
Time:                        23:14:31   Log-Likelihood:                          17.009
No. Observations:                 150   AIC:                                     -26.02
Df Residuals:                     146   BIC:                                     -13.98
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

### SVM <a id="SVM"></a>

A simple 10-fold CV SVM model using the iris dataset

In [58]:
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.97 (+/- 0.09)


### Neural Network <a id="NN"></a>

In [74]:
## split data set
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.33, 
                                                    random_state=42, stratify= iris.target)
 
## max min scalar on parameters
X_scaler = MinMaxScaler(feature_range=(0,1))
 
## Preprocessing the dataset
X_train_scaled = X_scaler.fit_transform(X_train)
X_test_scaled = X_scaler.fit_transform(X_test)
 
## One hot encode Y
onehot_encoder = OneHotEncoder(sparse=False)
Y_train_enc = onehot_encoder.fit_transform(Y_train.reshape(-1,1))
Y_test_enc = onehot_encoder.fit_transform(Y_test.reshape(-1,1))

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [92]:
model = IrisClassifier()

model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [93]:
model.fit(X_train_scaled, Y_train_enc, epochs=300, batch_size=10)

Train on 100 samples
Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300


<tensorflow.python.keras.callbacks.History at 0x1c50856c50>

In [85]:
scores = model.evaluate(X_test_scaled, Y_test_enc)
print("\nAccuracy: %.2f%%" % (scores[1]*100))


Accuracy: 98.00%


In [88]:
prediction = model.predict(X_test_scaled)
prediction1 = pd.DataFrame({'IRIS1':prediction[:,0],'IRIS2':prediction[:,1], 'IRIS3':prediction[:,2]})
prediction1.round(decimals=4).head()

Unnamed: 0,IRIS1,IRIS2,IRIS3
0,0.0,0.4839,0.5161
1,0.0,0.9491,0.0508
2,0.9998,0.0002,0.0
3,0.0002,0.9992,0.0006
4,0.0,0.0015,0.9985
