### Statistical Inference of Hidden Markov Models on High Frequency Quote Data

Benchmarking PSG performance of statistical inference against HMMLearn 

References:

PSG: http://aorda.com/html/PSG_Help_HTML/index.html?hmm_normal.htm

HMMLearn: https://hmmlearn.readthedocs.io/en/latest/index.html



In [155]:
import pandas as pd
import numpy as np
from hmmlearn.hmm import GaussianHMM
from sklearn.model_selection import TimeSeriesSplit
import matplotlib.pyplot as plt

import os
os.add_dll_directory('C:\Aorda\PSG\lib')
import psgpython as psg 
from psg_loader import load_psg

load_psg()

Inputting features removing duplicated values within each observation


In [159]:
features=pd.read_csv('data/features.csv',index_col=0,nrows=5000)

def remove_duplicates(series):
    
    cleaned_series=series[np.insert(np.diff(series).astype(bool), 0, True)]
    dropped_els=len(series)-len(cleaned_series)
    
    print(f"Dropped {dropped_els} of original {len(series)} consecutive repeated values from input series")
    return cleaned_series

bidsize=remove_duplicates(features['Bid_Size'].values)
offersize=remove_duplicates(features['Offer_Size'].values)
bookimbalance=remove_duplicates(features['OB_IB'].values)
spread=remove_duplicates(features['spread'].values)

# formatted as numpy float 
np.savetxt(r'psg_text_hmm/vector_bidsize.txt', bidsize)
np.savetxt(r'psg_text_hmm/vector_offersize.txt', offersize)
np.savetxt(r'psg_text_hmm/vector_bookimbalance.txt', bookimbalance)
np.savetxt(r'psg_text_hmm/vector_spread.txt', spread)



Dropped 2260 of original 5000 consecutive repeated values from input series
Dropped 2362 of original 5000 consecutive repeated values from input series
Dropped 1396 of original 5000 consecutive repeated values from input series
Dropped 1020 of original 5000 consecutive repeated values from input series


# Steps

- Train HMM on one feature at a time
- Assume each feature is sampled according to two normal distributions that are our hidden states. 
- Learn optimal parameterization of hidden states

### PSG

Utilized HMM_Normal optimization routine for each feature

Initial point is estimated via Baum-Welch Algorithm

Inference is completed via constrained optimization



### Spread

In [161]:
psg_spread_prob = psg.psg_importfromtext('./psg_text_hmm/problem_hmm_normal_spread.txt')
psg_spread_prob['problem_statement'] = '\n'.join(psg_spread_prob['problem_statement'])
spread_solution=psg.psg_solver(psg_spread_prob)
spread_solution.values()

OK. Problem Imported

Running solver
Reading problem formulation
Asking for data information
Getting data
    100.0% of scenarios is processed
100% of vector_spread was read
Start optimization
Ext.iteration=0  Objective=0.854976334224E+00  Residual=0.000000000000E+00
Ext.iteration=10  Objective=0.854976334224E+00  Residual=0.000000000000E+00
Optimization is stopped
Solution is optimal
Calculating resulting outputs. Writing solution.
Objective: objective = 20309.0557089 [-2.474368648186E+16]
Solver has normally finished. Solution was saved.
Problem: problem_hmm_normal, solution_status = optimal
Timing: data_loading_time = 0.11, preprocessing_time = 2.72, solving_time = 0.07
Variables: optimal_point = point_problem_hmm_normal
Objective: objective = 20309.0557089 [-2.474368648186E+16]
Constraint: sum_of_probabilities_for_states = vector_sum_of_probabilities_for_states
Function: hmm_normal(2,vector_spread) =  2.030905570892E+04
OK. Solver Finished



dict_values(['problem_hmm_normal', 'optimal', ['problem_HMM_Normal, maximize', '  hmm_normal(2,vector_spread)', '  Solver: VAN , precision=9, stages=10'], ['Problem: problem_hmm_normal, solution_status = optimal', 'Timing: data_loading_time = 0.11, preprocessing_time = 2.72, solving_time = 0.07', 'Variables: optimal_point = point_problem_hmm_normal', 'Objective: objective = 20309.0557089 [-2.474368648186E+16]', 'Constraint: sum_of_probabilities_for_states = vector_sum_of_probabilities_for_states', 'Function: hmm_normal(2,vector_spread) =  2.030905570892E+04'], [['p1', 'p2', 'a1_1', 'a1_2', 'a2_1', 'a2_2', 'mu1', 'si1', 'mu2', 'si2'], array([0.00000000e+00, 1.00000000e+00, 8.84131876e-01, 1.15868124e-01,
       7.21213114e-01, 2.78786886e-01, 1.09959669e-03, 5.39832179e-04,
       2.76887706e-02, 5.40693555e-02])], array([2., 1., 2., ..., 1., 1., 1.]), array([1., 1., 1.]), array([6.49480469e-14, 2.88657986e-15, 0.00000000e+00]), [['state1', 'state2'], array([[0.00000000e+00, 1.00000000e

### Book Imbalance

In [162]:
psg_bookimbalance_prob = psg.psg_importfromtext('./psg_text_hmm/problem_hmm_normal_bookimbalance.txt')
psg_bookimbalance_prob['problem_statement'] = '\n'.join(psg_bookimbalance_prob['problem_statement'])
bookimbalance_solution=psg.psg_solver(psg_bookimbalance_prob)
bookimbalance_solution.values()

OK. Problem Imported

Running solver
Reading problem formulation
Asking for data information
Getting data
    100.0% of scenarios is processed
100% of vector_bookimbalance was read
Start optimization
Ext.iteration=0  Objective=0.949324250930E+00  Residual=0.000000000000E+00
Ext.iteration=10  Objective=0.949324250930E+00  Residual=0.000000000000E+00
Optimization is stopped
Solution is optimal
Calculating resulting outputs. Writing solution.
Objective: objective = 21466.7161771 [-2.355482087659E+16]
Solver has normally finished. Solution was saved.
Problem: problem_hmm_normal, solution_status = optimal
Timing: data_loading_time = 0.08, preprocessing_time = 1.82, solving_time = 0.05
Variables: optimal_point = point_problem_hmm_normal
Objective: objective = 21466.7161771 [-2.355482087659E+16]
Constraint: sum_of_probabilities_for_states = vector_sum_of_probabilities_for_states
Function: hmm_normal(2,vector_bookimbalance) =  2.146671617708E+04
OK. Solver Finished



dict_values(['problem_hmm_normal', 'optimal', ['problem_HMM_Normal, maximize', '  hmm_normal(2,vector_bookimbalance)', '  Solver: VAN , precision=9, stages=10'], ['Problem: problem_hmm_normal, solution_status = optimal', 'Timing: data_loading_time = 0.08, preprocessing_time = 1.82, solving_time = 0.05', 'Variables: optimal_point = point_problem_hmm_normal', 'Objective: objective = 21466.7161771 [-2.355482087659E+16]', 'Constraint: sum_of_probabilities_for_states = vector_sum_of_probabilities_for_states', 'Function: hmm_normal(2,vector_bookimbalance) =  2.146671617708E+04'], [['p1', 'p2', 'a1_1', 'a1_2', 'a2_1', 'a2_2', 'mu1', 'si1', 'mu2', 'si2'], array([1.00000000e+00, 0.00000000e+00, 9.49108797e-01, 5.08912029e-02,
       7.85060430e-01, 2.14939570e-01, 5.73208114e-04, 4.05523971e-04,
       1.62814926e-02, 2.03592096e-02])], array([1., 1., 1., ..., 1., 1., 1.]), array([1., 1., 1.]), array([0.00000000e+00, 4.88498131e-15, 3.55271368e-15]), [['state1', 'state2'], array([[1.00000000e+0

### Offer Size

In [163]:
psg_offersize_prob = psg.psg_importfromtext('./psg_text_hmm/problem_hmm_normal_offersize.txt')
psg_offersize_prob['problem_statement'] = '\n'.join(psg_offersize_prob['problem_statement'])
offersize_solution=psg.psg_solver(psg_offersize_prob)
offersize_solution.values()

OK. Problem Imported

Running solver
Reading problem formulation
Asking for data information
Getting data
    100.0% of scenarios is processed
100% of vector_offersize was read
Start optimization
Ext.iteration=0  Objective=0.770200568914E+00  Residual=0.000000000000E+00
Ext.iteration=10  Objective=0.770200568914E+00  Residual=0.000000000000E+00
Optimization is stopped
Solution is optimal
Calculating resulting outputs. Writing solution.
Objective: objective = 14610.5883147 [-1.976025913521E+16]
Solver has normally finished. Solution was saved.
Problem: problem_hmm_normal, solution_status = optimal
Timing: data_loading_time = 0.16, preprocessing_time = 1.43, solving_time = 0.04
Variables: optimal_point = point_problem_hmm_normal
Objective: objective = 14610.5883147 [-1.976025913521E+16]
Constraint: sum_of_probabilities_for_states = vector_sum_of_probabilities_for_states
Function: hmm_normal(2,vector_offersize) =  1.461058831470E+04
OK. Solver Finished



dict_values(['problem_hmm_normal', 'optimal', ['problem_HMM_Normal, maximize', '  hmm_normal(2,vector_offersize)', '  Solver: VAN , precision=9, stages=10'], ['Problem: problem_hmm_normal, solution_status = optimal', 'Timing: data_loading_time = 0.16, preprocessing_time = 1.43, solving_time = 0.04', 'Variables: optimal_point = point_problem_hmm_normal', 'Objective: objective = 14610.5883147 [-1.976025913521E+16]', 'Constraint: sum_of_probabilities_for_states = vector_sum_of_probabilities_for_states', 'Function: hmm_normal(2,vector_offersize) =  1.461058831470E+04'], [['p1', 'p2', 'a1_1', 'a1_2', 'a2_1', 'a2_2', 'mu1', 'si1', 'mu2', 'si2'], array([1.00000000e+00, 0.00000000e+00, 9.32063182e-01, 6.79368183e-02,
       6.86041627e-01, 3.13958373e-01, 5.11936157e-04, 5.34551600e-04,
       1.87019261e-02, 1.96073010e-02])], array([1., 1., 1., ..., 1., 1., 1.]), array([1., 1., 1.]), array([0.00000000e+00, 4.88498131e-15, 4.66293670e-15]), [['state1', 'state2'], array([[1.00000000e+00, 0.000

### Bid Size

In [164]:
psg_bidsize_prob = psg.psg_importfromtext('./psg_text_hmm/problem_hmm_normal_bidsize.txt')
psg_bidsize_prob['problem_statement'] = '\n'.join(psg_bidsize_prob['problem_statement'])
bidsize_solution=psg.psg_solver(psg_bidsize_prob)
bidsize_solution.values()

OK. Problem Imported

Running solver
Reading problem formulation
Asking for data information
Getting data
    100.0% of scenarios is processed
100% of vector_bidsize was read
Start optimization
Ext.iteration=0  Objective=0.661743463934E+00  Residual=0.000000000000E+00
Ext.iteration=10  Objective=0.661743463934E+00  Residual=0.000000000000E+00
Optimization is stopped
Solution is optimal
Calculating resulting outputs. Writing solution.
Objective: objective = 9707.11338643 [-1.528020599415E+16]
Solver has normally finished. Solution was saved.
Problem: problem_hmm_normal, solution_status = optimal
Timing: data_loading_time = 0.09, preprocessing_time = 1.64, solving_time = 0.03
Variables: optimal_point = point_problem_hmm_normal
Objective: objective = 9707.11338643 [-1.528020599415E+16]
Constraint: sum_of_probabilities_for_states = vector_sum_of_probabilities_for_states
Function: hmm_normal(2,vector_bidsize) =  9.707113386427E+03
OK. Solver Finished



dict_values(['problem_hmm_normal', 'optimal', ['problem_HMM_Normal, maximize', '  hmm_normal(2,vector_bidsize)', '  Solver: VAN , precision=9, stages=10'], ['Problem: problem_hmm_normal, solution_status = optimal', 'Timing: data_loading_time = 0.09, preprocessing_time = 1.64, solving_time = 0.03', 'Variables: optimal_point = point_problem_hmm_normal', 'Objective: objective = 9707.11338643 [-1.528020599415E+16]', 'Constraint: sum_of_probabilities_for_states = vector_sum_of_probabilities_for_states', 'Function: hmm_normal(2,vector_bidsize) =  9.707113386427E+03'], [['p1', 'p2', 'a1_1', 'a1_2', 'a2_1', 'a2_2', 'mu1', 'si1', 'mu2', 'si2'], array([1.        , 0.        , 0.91938504, 0.08061496, 0.58222151,
       0.41777849, 0.00329771, 0.0035625 , 0.06192256, 0.09484753])], array([1., 1., 1., ..., 1., 1., 1.]), array([1., 1., 1.]), array([0.00000000e+00, 4.21884749e-15, 3.99680289e-15]), [['state1', 'state2'], array([[1.        , 0.        ],
       [0.99824468, 0.00175532],
       [0.9971

### HMM Learn


- Set 2 hidden components
- Solves via the Viterbi Forward Backwards Algorithm
- Full covariance matrix with min_covar to prevent overfitting
- Tolerance set equivantly to PSG

### Spread

In [175]:
spread_model=GaussianHMM(n_components=2,algorithm='viterbi',covariance_type="full",min_covar=1e-4, n_iter=1000,tol=1e-8, verbose=True)
fitted_spread_model=spread_model.fit(spread.reshape(-1, 1))

         1       10371.4744             +nan
         2       14507.2116       +4135.7372
         3       15345.2626        +838.0510
         4       16186.2642        +841.0016
         5       17076.7994        +890.5352
         6       17684.9248        +608.1254
         7       17949.3545        +264.4297
         8       18017.1622         +67.8077
         9       18031.9423         +14.7800
        10       18035.1381          +3.1958
        11       18035.8338          +0.6957
        12       18035.9859          +0.1521
        13       18036.0192          +0.0333
        14       18036.0266          +0.0073
        15       18036.0282          +0.0016
        16       18036.0285          +0.0004
        17       18036.0286          +0.0001
        18       18036.0286          +0.0000
        19       18036.0286          +0.0000
        20       18036.0286          +0.0000
        21       18036.0286          +0.0000
        22       18036.0286          +0.0000
        23

In [176]:
print(f"Transition Matrix is {fitted_spread_model.transmat_.flatten()}")
print(f"Mean Values are is {fitted_spread_model.means_.flatten()}")
print(f"Covariance Matrix is {fitted_spread_model.covars_.flatten()}")

Transition Matrix is [0.9077427  0.0922573  0.77046555 0.22953445]
Mean Values are is [0.00122037 0.03458228]
Covariance Matrix is [3.56215183e-06 3.60785320e-03]


### Book Imbalance

In [177]:
bookimbalance_model=GaussianHMM(n_components=2,algorithm='viterbi',covariance_type="full",min_covar=1e-4, n_iter=1000,tol=1e-8, verbose=True)
fitted_bookimbalance_model=bookimbalance_model.fit(bookimbalance.reshape(-1, 1))

         1        2323.1534             +nan
         2       18275.7645      +15952.6111
         3       18495.6910        +219.9265
         4       18516.7125         +21.0215
         5       18518.0513          +1.3388
         6       18518.1532          +0.1018
         7       18518.1989          +0.0457
         8       18518.2185          +0.0196
         9       18518.2254          +0.0069
        10       18518.2276          +0.0022
        11       18518.2283          +0.0007
        12       18518.2285          +0.0002
        13       18518.2285          +0.0001
        14       18518.2285          +0.0000
        15       18518.2285          +0.0000
        16       18518.2285          +0.0000
        17       18518.2285          +0.0000
        18       18518.2285          +0.0000
        19       18518.2285          +0.0000
        20       18518.2285          +0.0000
        21       18518.2285          +0.0000


In [180]:
print(f"Transition Matrix is {fitted_bookimbalance_model.transmat_.flatten()}")
print(f"Mean Values are is {fitted_bookimbalance_model.means_.flatten()}")
print(f"Covariance Matrix is {fitted_bookimbalance_model.covars_.flatten()}")

Transition Matrix is [0.97733664 0.02266336 0.94103825 0.05896175]
Mean Values are is [0.00067809 0.03688092]
Covariance Matrix is [3.35068608e-06 4.96496720e-04]


### Bid Size

In [181]:
bidsize_model=GaussianHMM(n_components=2,algorithm='viterbi',covariance_type="full",min_covar=1e-4, n_iter=1000,tol=1e-8, verbose=True)
fitted_bidsize_model=bidsize_model.fit(bidsize.reshape(-1, 1))

         1        -205.4457             +nan
         2        7965.9595       +8171.4052
         3        8771.3726        +805.4131
         4        9280.9350        +509.5625
         5        9468.3391        +187.4041
         6        9575.8131        +107.4740
         7        9617.8312         +42.0181
         8        9633.5069         +15.6757
         9        9640.7660          +7.2590
        10        9644.5897          +3.8237
        11        9646.7362          +2.1466
        12        9647.9830          +1.2468
        13        9648.7199          +0.7369
        14        9649.1589          +0.4390
        15        9649.4211          +0.2622
        16        9649.5778          +0.1567
        17        9649.6714          +0.0936
        18        9649.7272          +0.0558
        19        9649.7605          +0.0333
        20        9649.7803          +0.0198
        21        9649.7922          +0.0118
        22        9649.7992          +0.0070
        23

In [182]:
print(f"Transition Matrix is {fitted_bidsize_model.transmat_.flatten()}")
print(f"Mean Values are is {fitted_bidsize_model.means_.flatten()}")
print(f"Covariance Matrix is {fitted_bidsize_model.covars_.flatten()}")

Transition Matrix is [0.93687424 0.06312576 0.52269516 0.47730484]
Mean Values are is [0.00352137 0.0676133 ]
Covariance Matrix is [1.95619226e-05 9.90502150e-03]


### Offer Size

In [183]:
offersize_model=GaussianHMM(n_components=2,algorithm='viterbi',covariance_type="full",min_covar=1e-4, n_iter=1000,tol=1e-8, verbose=True)
fitted_offersize_model=offersize_model.fit(offersize.reshape(-1, 1))

         1        7492.4343             +nan
         2       11926.4027       +4433.9684
         3       12516.6284        +590.2257
         4       12623.2027        +106.5743
         5       12651.3003         +28.0976
         6       12660.4762          +9.1759
         7       12663.6569          +3.1807
         8       12664.8262          +1.1693
         9       12665.2836          +0.4573
        10       12665.4736          +0.1900
        11       12665.5567          +0.0831
        12       12665.5946          +0.0379
        13       12665.6123          +0.0177
        14       12665.6207          +0.0084
        15       12665.6248          +0.0041
        16       12665.6267          +0.0020
        17       12665.6277          +0.0010
        18       12665.6282          +0.0005
        19       12665.6284          +0.0002
        20       12665.6285          +0.0001
        21       12665.6286          +0.0001
        22       12665.6286          +0.0000
        23

In [184]:
print(f"Transition Matrix is {fitted_offersize_model.transmat_.flatten()}")
print(f"Mean Values are is {fitted_offersize_model.means_.flatten()}")
print(f"Covariance Matrix is {fitted_offersize_model.covars_.flatten()}")

Transition Matrix is [0.96179163 0.03820837 0.76666404 0.23333596]
Mean Values are is [0.0006806  0.03165246]
Covariance Matrix is [4.99663118e-06 4.52385708e-04]
