# Introduction
Machine Learning Algorithm that takes an asset's high, low, close, open interest and volume information, trains and predicts if the future realized volatility will be higher or lower than a user defined level.  

Preprossing functions cleans and creates data frames that looks ahead to a forward period from points in time and determines if a user defined volatility threshold has been exceeded.  

This allows for a Machine Learning Algorithm to train on the data set and make predictions.   

Within this notebook, a Machine Learning clustering Decision Tree algorithm trains on a financial asset's daily market information.  Function allows user to set a range of volatilities to test after training.  It then makes predictions based on the most recent days in the data set.  

# Data
The data used here is from BarChart.com.  I download futures data that is 'daily nearby', If it has a symbol column, that which eventually needs to be removed within the function.  In order to process, the first line of the data needs to be the columns names and all the null values need to be filled in manually.  

In [1]:
import pandas as pd
import numpy as np

In [2]:
from IPython.display import display # Allows the use of display() for DataFrames

# Pretty display for notebooks
%matplotlib inline

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

In [5]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [6]:
from sklearn.metrics import precision_recall_fscore_support

In [7]:
from sklearn.preprocessing import StandardScaler

In [8]:
file = r"C:\Users\Matt\Desktop\data\subject_data.csv"

In [9]:
og = pd.read_csv(file)

In [10]:
len(og)

2049

In [11]:
def vol_convert(vol):
    rate = vol/1600
    return rate

In [12]:
start_vol = 25
end_vol = 40
step = .5
hist_period1 = 10
hist_period2 = 20
hist_period3 = 30
volume_period1 = 2
volume_period2 = 4
forward_vol_period = 20
vol_to_test = 6
rate = vol_convert(vol_to_test)

In [13]:
og.head()

Unnamed: 0,Date Time,Symbol,Open,High,Low,Close,Change,Volume,Open Interest
0,1/3/2012,E6H12,1.3047,1.3085,1.302,1.3063,0.0,137219,282790
1,1/4/2012,E6H12,1.3059,1.3063,1.2904,1.2944,-0.0119,216177,293235
2,1/5/2012,E6H12,1.2948,1.2952,1.2777,1.2792,-0.0152,266566,298476
3,1/6/2012,E6H12,1.2798,1.282,1.2703,1.2729,-0.0063,227315,293406
4,1/9/2012,E6H12,1.27,1.2791,1.2673,1.2761,0.0032,187019,291209


In [14]:
data = og.copy()

In [15]:
#process dataset ready for machine learning, add custome columns, forward vol uses mean close to close and abs change 

def process(dataframe, hist1, hist2, hist3, volume1, volume2, forward_vol_period, question_vol, cushion):
    
    
    dataframe['abs_change'] = dataframe['Change'].abs() / dataframe['Close']
    
    dataframe['high_move'] = (((dataframe.High - dataframe.Close.shift(1))/dataframe['Close']).abs())
    dataframe['low_move'] = (((dataframe.Low - data.Close.shift(1))/dataframe['Close']).abs())
    dataframe['max_move'] = dataframe[['high_move', 'low_move']].max(axis=1)
    
    dataframe['hist_max_1'] = dataframe.max_move.rolling(window=hist1).mean()
    dataframe['hist_max_2'] = dataframe.max_move.rolling(window=hist2).mean()
    dataframe['hist_max_3'] = dataframe.max_move.rolling(window=hist3).mean()
    
    dataframe['hist_change_1'] = dataframe.abs_change.rolling(window=hist1).mean()
    dataframe['hist_change_2'] = dataframe.abs_change.rolling(window=hist2).mean()
    dataframe['hist_change_3'] = dataframe.abs_change.rolling(window=hist3).mean()
    
    dataframe['avg_max_close'] = dataframe[['max_move', 'abs_change']].mean(axis=1)
    
    dataframe['hist_avgmax_1'] = dataframe.avg_max_close.rolling(window=hist1).mean()
    dataframe['hist_avgmax_2'] = dataframe.avg_max_close.rolling(window=hist2).mean()
    dataframe['hist_avgmax_3'] = dataframe.avg_max_close.rolling(window=hist3).mean()
    
    dataframe['hist_volume_1'] = dataframe.Volume.rolling(window=volume1).mean()
    dataframe['hist_volume_2'] = dataframe.Volume.rolling(window=volume2).mean()
    
    dataframe['backtothefuture'] = dataframe.avg_max_close.rolling(window=forward_vol_period).mean()
    
    dataframe['forward_avg_close_max'] = dataframe['backtothefuture'].shift(-forward_vol_period)
    
    
    
    
    
    
    dataframe['volatile'] = (dataframe['forward_avg_close_max'] > question_vol + cushion)
    dataframe = dataframe.applymap(lambda x: 1 if x == True else x)
    dataframe = dataframe.applymap(lambda x: 0 if x == False else x)
    #select columns
    the_columns = ['hist_avgmax_1', 'hist_avgmax_2', 'hist_avgmax_3', 'hist_volume_1', 'hist_volume_2', 
               'forward_avg_close_max', 'volatile', 'Open Interest' ]
    dataframe = dataframe[the_columns]
    
    return dataframe 

In [16]:
#set and process dataset 
new_data = process(data,hist_period1 ,hist_period2 ,hist_period3 ,volume_period1,volume_period2,forward_vol_period, rate, 0)

In [17]:
new_data.tail()

Unnamed: 0,hist_avgmax_1,hist_avgmax_2,hist_avgmax_3,hist_volume_1,hist_volume_2,forward_avg_close_max,volatile,Open Interest
2044,0.002893,0.002587,0.002667,167597.5,177291.0,,0,612444
2045,0.003292,0.002744,0.002778,183868.5,180334.75,,0,618413
2046,0.003462,0.002761,0.002781,195849.5,181723.5,,0,622327
2047,0.002996,0.002711,0.002616,181372.0,182620.25,,0,624300
2048,0.002825,0.002611,0.00262,181384.0,188616.75,,0,625400


In [18]:
new_data['forward_avg_close_max'].describe()

count    2029.000000
mean        0.004897
std         0.001550
min         0.001984
25%         0.003874
50%         0.004709
75%         0.005720
max         0.010881
Name: forward_avg_close_max, dtype: float64

In [19]:
new_data['volatile'].value_counts()

1    1592
0     457
Name: volatile, dtype: int64

In [20]:
new_data = new_data.dropna(how = 'any')

In [21]:
new_data.head(20)

Unnamed: 0,hist_avgmax_1,hist_avgmax_2,hist_avgmax_3,hist_volume_1,hist_volume_2,forward_avg_close_max,volatile,Open Interest
29,0.006082,0.006306,0.006827,246237.0,262138.25,0.006454,1,291060
30,0.005789,0.006039,0.006991,302346.0,265401.5,0.006534,1,299085
31,0.006128,0.005926,0.006857,340869.0,293553.0,0.006541,1,294462
32,0.006086,0.005946,0.006537,296481.0,299413.5,0.006696,1,295500
33,0.006341,0.005935,0.00662,309263.5,325066.25,0.006417,1,287525
34,0.005432,0.005872,0.006544,311980.5,304230.75,0.006502,1,287495
35,0.006105,0.00599,0.006701,280398.5,294831.0,0.006251,1,280074
36,0.00672,0.00627,0.006746,293985.0,302982.75,0.006104,1,276830
37,0.006346,0.006135,0.006576,263755.5,272077.0,0.006158,1,272859
38,0.006295,0.005986,0.006263,257792.0,275888.5,0.006001,1,271282


In [22]:
# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = new_data['volatile']


del new_data['volatile']
del new_data['forward_avg_close_max']

# removed 'Symbol' in process function
#del new_data['Symbol']

features = new_data

# Show the new dataset with 'Survived' removed
features.head()

Unnamed: 0,hist_avgmax_1,hist_avgmax_2,hist_avgmax_3,hist_volume_1,hist_volume_2,Open Interest
29,0.006082,0.006306,0.006827,246237.0,262138.25,291060
30,0.005789,0.006039,0.006991,302346.0,265401.5,299085
31,0.006128,0.005926,0.006857,340869.0,293553.0,294462
32,0.006086,0.005946,0.006537,296481.0,299413.5,295500
33,0.006341,0.005935,0.00662,309263.5,325066.25,287525


In [23]:
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.30, random_state=42)

In [24]:
# TODO: Define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [25]:
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 1.0
The test accuracy is 0.9166666666666666


In [26]:
# Training the model
model = DecisionTreeClassifier(max_depth=15, min_samples_leaf=20, min_samples_split=20)
model.fit(X_train, y_train)

# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculating accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 0.9178571428571428
The test accuracy is 0.8716666666666667


In [27]:
def process_sample(dataframe, hist1, hist2, hist3, volume1, volume2):
    
    dataframe['abs_change'] = dataframe['Change'].abs() / dataframe['Close']
    
    dataframe['high_move'] = (((dataframe.High - dataframe.Close.shift(1))/dataframe['Close']).abs())
    dataframe['low_move'] = (((dataframe.Low - data.Close.shift(1))/dataframe['Close']).abs())
    dataframe['max_move'] = dataframe[['high_move', 'low_move']].max(axis=1)
    
    dataframe['hist_max_1'] = dataframe.max_move.rolling(window=hist1).mean()
    dataframe['hist_max_2'] = dataframe.max_move.rolling(window=hist2).mean()
    dataframe['hist_max_3'] = dataframe.max_move.rolling(window=hist3).mean()
    
    dataframe['hist_change_1'] = dataframe.abs_change.rolling(window=hist1).mean()
    dataframe['hist_change_2'] = dataframe.abs_change.rolling(window=hist2).mean()
    dataframe['hist_change_3'] = dataframe.abs_change.rolling(window=hist3).mean()
    
    dataframe['avg_max_close'] = dataframe[['max_move', 'abs_change']].mean(axis=1)
    
    dataframe['hist_avgmax_1'] = dataframe.avg_max_close.rolling(window=hist1).mean()
    dataframe['hist_avgmax_2'] = dataframe.avg_max_close.rolling(window=hist2).mean()
    dataframe['hist_avgmax_3'] = dataframe.avg_max_close.rolling(window=hist3).mean()
    
    dataframe['hist_volume_1'] = dataframe.Volume.rolling(window=volume1).mean()
    dataframe['hist_volume_2'] = dataframe.Volume.rolling(window=volume2).mean()
    

    #drop column 
    the_columns = ['hist_avgmax_1', 'hist_avgmax_2', 'hist_avgmax_3', 'hist_volume_1', 'hist_volume_2', 
                   'Open Interest' ]
    dataframe = dataframe[the_columns]
    
    dataframe = dataframe.dropna(how = 'any')
    
    #scale the dataframe
    scaler = StandardScaler().fit(dataframe)
    new_df = scaler.transform(dataframe)
    
    
    return new_df 

In [28]:
og.head()

Unnamed: 0,Date Time,Symbol,Open,High,Low,Close,Change,Volume,Open Interest
0,1/3/2012,E6H12,1.3047,1.3085,1.302,1.3063,0.0,137219,282790
1,1/4/2012,E6H12,1.3059,1.3063,1.2904,1.2944,-0.0119,216177,293235
2,1/5/2012,E6H12,1.2948,1.2952,1.2777,1.2792,-0.0152,266566,298476
3,1/6/2012,E6H12,1.2798,1.282,1.2703,1.2729,-0.0063,227315,293406
4,1/9/2012,E6H12,1.27,1.2791,1.2673,1.2761,0.0032,187019,291209


In [29]:
new_og = process_sample(og, hist_period1 ,hist_period2 ,hist_period3 , volume_period1, volume_period2)

In [30]:
new_og

array([[ 0.6948196 ,  0.91687864,  1.30789065,  0.14433588,  0.36239837,
        -0.97276114],
       [ 0.5254919 ,  0.74415545,  1.41960289,  0.74936816,  0.40282767,
        -0.90084381],
       [ 0.72133335,  0.67077212,  1.32815434,  1.1647678 ,  0.75160432,
        -0.94227357],
       ...,
       [-0.82032542, -1.37497566, -1.43800904, -0.39900056, -0.63388204,
         1.99594164],
       [-1.08942604, -1.40734506, -1.55024655, -0.55511375, -0.62277195,
         2.013623  ],
       [-1.18835934, -1.47238368, -1.54767081, -0.55498435, -0.54847967,
         2.02348083]])

In [31]:
new_og[-1]

array([-1.18835934, -1.47238368, -1.54767081, -0.55498435, -0.54847967,
        2.02348083])

In [32]:
# delete all non-floats

#del new_og['Date Time']
#del new_og['Symbol']


In [33]:
new_og.tail()

AttributeError: 'numpy.ndarray' object has no attribute 'tail'

In [34]:
last_date = new_og[-1]
last_date

array([-1.18835934, -1.47238368, -1.54767081, -0.55498435, -0.54847967,
        2.02348083])

In [35]:
#one_sample = [new_og.iloc[last_date,:]] 

In [36]:
model.predict([last_date])

array([0], dtype=int64)

In [37]:
def tail_vol(period, dataframe):
    rate = dataframe['avg_max_close'].tail(period).mean()
    vol = round((rate*100*16),2)
    
    return vol

In [38]:
#tail_vol(20, new_og)

In [39]:
answer = model.predict(one_sample)
answer = answer[0]
answer

NameError: name 'one_sample' is not defined

In [40]:
experiment_df = pd.read_csv(file)

experiment = process(experiment_df,hist_period1 ,hist_period2 ,hist_period3 ,volume_period1,volume_period2,
                        forward_vol_period, rate, 0)

In [41]:
len(experiment)

2049

In [42]:
pd.set_option('display.max_columns', 999)

In [43]:
experiment.head()

Unnamed: 0,hist_avgmax_1,hist_avgmax_2,hist_avgmax_3,hist_volume_1,hist_volume_2,forward_avg_close_max,volatile,Open Interest
0,,,,,,0.007592,1,282790
1,,,,176698.0,,0.007221,1,293235
2,,,,241371.5,,0.006763,1,298476
3,,,,246940.5,211819.25,0.00676,1,293406
4,,,,207167.0,224269.25,0.0071,1,291209


In [44]:
#select columns to keep, create list to reduce variables
the_columns = ['hist_avgmax_1', 'hist_avgmax_2', 'hist_avgmax_3', 'hist_volume_1', 'hist_volume_2', 
               'forward_avg_close_max', 'volatile', 'Open Interest' ]

In [45]:
experiment = experiment[the_columns]

In [46]:
experiment.describe()

Unnamed: 0,hist_avgmax_1,hist_avgmax_2,hist_avgmax_3,hist_volume_1,hist_volume_2,forward_avg_close_max,volatile,Open Interest
count,2040.0,2030.0,2020.0,2048.0,2046.0,2029.0,2049.0,2049.0
mean,0.004901,0.004898,0.0049,233645.528076,233683.103006,0.004897,0.776964,398181.360176
std,0.001736,0.00155,0.001474,92562.260239,80650.400544,0.00155,0.416384,111463.397627
min,0.001721,0.001984,0.002096,30277.5,69814.5,0.001984,0.0,185313.0
25%,0.003739,0.003875,0.003921,172799.375,178572.875,0.003874,1.0,300962.0
50%,0.00463,0.00471,0.004711,217409.0,221077.875,0.004709,1.0,408853.0
75%,0.005753,0.00572,0.005683,268583.75,268934.625,0.00572,1.0,492662.0
max,0.014023,0.010881,0.010173,831893.0,644397.75,0.010881,1.0,666250.0


In [47]:
def find_vol(dataframe, start_vol, end_vol, step, hist_period1, hist_period2, hist_period3, 
             volume_period1, volume_period2, forward_vol_period):
    vol = start_vol
    while vol < end_vol: 
        answer = 1 
        rate = vol_convert(vol)
        beg_frame = dataframe.copy()
        
        frame = process(dataframe,hist_period1 ,hist_period2 ,hist_period3 ,volume_period1,volume_period2,
                        forward_vol_period, rate, 0)
   
    
        frame = frame.dropna(how = 'any')
    
        outcomes = frame['volatile']
        features = frame

        del features['volatile']
        del features['forward_avg_close_max']
        
        #scale the data
        scaler = StandardScaler().fit(features)
        rescaledX = scaler.transform(features)
        
        #split
        X_train, X_test, y_train, y_test = train_test_split(rescaledX, outcomes, test_size=0.30, random_state=42)
        # Training the model
        model = DecisionTreeClassifier(max_depth=15, min_samples_leaf=20, min_samples_split=20)
        model.fit(X_train, y_train)

        # Making predictions
        y_train_pred = model.predict(X_train)
        y_test_pred = model.predict(X_test)

        # Calculating accuracies
        train_accuracy = accuracy_score(y_train, y_train_pred)
        test_accuracy = accuracy_score(y_test, y_test_pred)
        precision = precision_score(y_test, y_test_pred)
        recall = recall_score(y_test, y_test_pred)
        the_f1 = f1_score(y_test, y_test_pred)
        
   
    
        original_data = process_sample(beg_frame, hist_period1 ,hist_period2 ,hist_period3 , volume_period1, volume_period2)
        #original_data = original_data.dropna(how = 'any')
        
        #del original_data['Date Time']
        last_row = original_data[-1]
        one_sample = last_row
    
        answer = model.predict([one_sample])
        answer = answer[0]
        
        #send feature columns to a list
        cols = features.columns.tolist()
        
        
        print(vol)
        print(answer)
        print('The training accuracy is', train_accuracy)
        print('The test accuracy is', test_accuracy)
        print('The precision is', precision)
        print('The recall is', recall)
        print('The F1 is', the_f1 )
        
        
        print('The number of days', len(outcomes))
        print('The percentage of volatile days', ((outcomes == 1).sum()) / len(outcomes))
        
        print('The feauture columns are', cols)
        
        
        
        #print(frame.head(1))
    
        #if answer == 0:
            #break

        vol = vol + step
    


In [60]:
fv_dataframe = pd.read_csv(file)
vol = 4.75
start_vol = 4
end_vol = 5.5
step = .10
hist_period1 = 5
hist_period2 = 22
hist_period3 = 33
volume_period1 = 4
volume_period2 = 8
forward_vol_period = 33
rate = vol_convert(vol)

In [61]:
check = process(fv_dataframe,hist_period1 ,hist_period2 ,hist_period3 ,volume_period1,volume_period2,
                        forward_vol_period, rate, 0)

In [62]:
check.head()

Unnamed: 0,hist_avgmax_1,hist_avgmax_2,hist_avgmax_3,hist_volume_1,hist_volume_2,forward_avg_close_max,volatile,Open Interest
0,,,,,,0.006902,1,282790
1,,,,,,0.006619,1,293235
2,,,,,,0.006499,1,298476
3,,,,211819.25,,0.006629,1,293406
4,0.006572,,,224269.25,,0.006693,1,291209


In [63]:
find_vol(fv_dataframe, start_vol, end_vol, step, hist_period1, hist_period2, hist_period3, 
         volume_period1, volume_period2, forward_vol_period)

4
1
The training accuracy is 0.9906340057636888
The test accuracy is 0.9865771812080537
The precision is 0.9880546075085325
The recall is 0.9982758620689656
The F1 is 0.993138936535163
The number of days 1984
The percentage of volatile days 0.9783266129032258
The feauture columns are ['hist_avgmax_1', 'hist_avgmax_2', 'hist_avgmax_3', 'hist_volume_1', 'hist_volume_2', 'Open Interest']
4.1
1
The training accuracy is 0.984149855907781
The test accuracy is 0.9798657718120806
The precision is 0.9812286689419796
The recall is 0.9982638888888888
The F1 is 0.9896729776247848
The number of days 1984
The percentage of volatile days 0.9722782258064516
The feauture columns are ['hist_avgmax_1', 'hist_avgmax_2', 'hist_avgmax_3', 'hist_volume_1', 'hist_volume_2', 'Open Interest']
4.199999999999999
1
The training accuracy is 0.973342939481268
The test accuracy is 0.964765100671141
The precision is 0.9706896551724138
The recall is 0.9929453262786596
The F1 is 0.9816913687881429
The number of days 198