# Introduction

> **Underlying Literature**: The following module was inspired by the ideas put forward in Chapter 19, Section 3 of [Advances in Financial Machine Learning](https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086) by Marcos Lopez de Prado

In this notebook, we showcase how to properly use the functions belonging to the First Generation Models of our Microstructural Features module.

We will first apply the Tick rule to our data set in in order to classify our trades as either sell-initiated or buy-initiated. Then, we will show how simple it is to use our feature transformation functions to produce  valuable features to be used in an ML model. Finally, we’ll showcase how to generate all of the transformations stored in this module by using one function call and providing a lookback period to generate these features on a rolling basis

### Table of Contents
- [The Tick Rule](#tick)
- [The Roll Model](#roll)
- [Feature Transformations:](#features)
    - [Fractional Differentiation](#frac_diff)
    - [Wald-Wolfowitz Runs Randomness](#wald_wolf)
    - [Entropy Measures](#entropy)
- [The Feature Matrix](#matrix)

Before starting, we must first import our tick data from the sample data folder

In [15]:
# Importing packages
import pandas as pd

from mlfinlab.microstructural_features import first_generation as first_gen 
from mlfinlab.microstructural_features import entropy 

In [16]:
# Reading in the tick data and only storing the closing price

url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/tick_bars.csv"
tick_prices = pd.read_csv(url, index_col=0)['close']

# Previewing the data
tick_prices.head()

date_time
2015-01-01 23:16:58.834    2058.75
2015-01-02 00:36:33.094    2058.25
2015-01-02 01:54:08.770    2061.00
2015-01-02 04:28:09.015    2062.50
2015-01-02 06:57:53.850    2063.00
Name: close, dtype: float64

## The Tick Rule <a class="anchor" id="tick"></a>

The Tick Rule is an algorithm used to determine a trade’s aggressor side. A buy-initiated trade is labeled (1)

In [17]:
# Generating trade classifications for our tick data
tick_classifications = first_gen.tick_rule(prices = tick_prices)

# Previewing the tick classifications
tick_classifications.head()

date_time
2015-01-01 23:16:58.834    1.0
2015-01-02 00:36:33.094   -1.0
2015-01-02 01:54:08.770    1.0
2015-01-02 04:28:09.015    1.0
2015-01-02 06:57:53.850    1.0
dtype: float64

## The Roll Model <a class="anchor" id="roll"></a>

The Roll Model (1984) provides market microstructure model that aims at estimating the effective bid-ask spread of a security from observed transaction prices. That said, the Roll model does not include any information on the underlying bid-ask price quotes and order flow.

In [18]:
# Calculating the Roll spread of the security based on the tick data
spread = first_gen.roll_spread(prices = tick_prices)

# Previewing the Roll spread
spread

0.36960456437467826

## Feature Transformations <a class="anchor" id="features"></a>

There are many transformations that can be applied to the trade classifications yielded by the Tick Rule that make for interesting feature inputs to an ML model. The transformations contained in this module include the Wald-Wolfowitz Runs test to the classification series to determine how random the classifications are, fractional differencing of the classification series to achieve stationarity while simultaneously preserving a high degree of information, and various entropy measures that determine the amount of information contained in the classification sequence.

### Fractional Differentiation <a class="anchor" id="frac_diff"></a>

The key to successfully using the `fractional_differencing` function is to experiment with different differencing values and threshold values. If the result you are obtaining is not stationary enough (i.e. the p-value of the ADF test is too high) then you should increase the differencing amount. If the result you are obtaining is not correlated enough to the original series (i.e. the Pearson r-value is too low) then you should decrease the differencing amount

In [19]:
# Generating the fractionally differenced series and it's associated, relevant test statistics
fractionally_differenced_classifications = first_gen.fractional_differencing(
    classifications = tick_classifications,
    differencing_amount = 0.453,
    threshold = 0.01)

# Previewing the output
fractionally_differenced_classifications

{'series': date_time
 2015-01-01 23:16:58.834          NaN
 2015-01-02 00:36:33.094          NaN
 2015-01-02 01:54:08.770          NaN
 2015-01-02 04:28:09.015          NaN
 2015-01-02 06:57:53.850          NaN
                              ...    
 2016-12-30 20:58:37.916    86.773701
 2016-12-30 20:59:30.587    87.280432
 2016-12-30 20:59:53.515    85.685898
 2016-12-30 21:00:21.588    84.951707
 2016-12-30 21:08:09.245    84.418463
 Name: close, Length: 41123, dtype: float64,
 'adf_p_value': 0.05,
 'pearson_r_value': 0.999}

The `fractional_differencing` function returns a dictionary that contains the differenced series, the p-value of an ADF test, and the r value of a Pearson test. These are all stored in a dictionary and can be accessed using the appropriate keys

In [20]:
# Previewing the ADF p-value
fractionally_differenced_classifications['adf_p_value']

0.05

In [21]:
# Previewing the Pearson r value
fractionally_differenced_classifications['pearson_r_value']

0.999

In [22]:
# Previewing the fractionally differenced series
fractionally_differenced_classifications['series']

date_time
2015-01-01 23:16:58.834          NaN
2015-01-02 00:36:33.094          NaN
2015-01-02 01:54:08.770          NaN
2015-01-02 04:28:09.015          NaN
2015-01-02 06:57:53.850          NaN
                             ...    
2016-12-30 20:58:37.916    86.773701
2016-12-30 20:59:30.587    87.280432
2016-12-30 20:59:53.515    85.685898
2016-12-30 21:00:21.588    84.951707
2016-12-30 21:08:09.245    84.418463
Name: close, Length: 41123, dtype: float64

### Wald-Wolfowitz Runs Randomness <a class="anchor" id="wald_wolf"></a>

The `wald_wolfowitz_runs_test` function returns the p-value of a Wald-Wolfowitz runs statistical test. The lower the p-value, the more likely the ordering of the runs is not random

In [23]:
# Generating the p-value of the Wald-Wolfowitz runs test
wald_wolfowitz_randomness = first_gen.wald_wolfowitz_runs_test(classifications = tick_classifications)

# Previewing the p-value
wald_wolfowitz_randomness

7.494397640453383e-35

### Entropy Measures <a class="anchor" id="entropy"></a>

Entropy measures in financial sciences aim to discern how much information is contained in a given time series. That said, when markets are not perfect, prices are formed with partial information. As a result, entropy measures are helpful in determining just how much useful information is contained in said price signals.

####  Shannon Entropy

Claude Shannon is credited with having one of the first conceptualizations of entropy in 1948, which he defined as the
average amount of information produced by a stationary source of data. More robustly defined, entropy is the smallest
number of bits per character required to describe a message in a uniquely decodable way

In [24]:
# Calculating the Shannon entropy of a subset of the tick classifications
shannon_entropy = entropy.get_shannon_entropy(message = tick_classifications[:100])

# Previewing the Shannon entropy
shannon_entropy

0.9895875212220555

#### Pug-In Entropy

Gao et al. (2008) built on the work done by Shannon by conceptualizing the Plug-in measure of entropy, also known as
the maximum likelihood estimator of entropy. Given a data sequence $x_{1}^{n}$, comprising the string of values starting in position 1 and ending in position
:$n$, we can form a dictionary of all words of length $w < n$ in that sequence: $A^w$.

In [25]:
# Calculating the Plug-in entropy of a subset of the tick classifications
plug_in_entropy = entropy.get_plug_in_entropy(message = tick_classifications[:100])

# Previewing the Plug-in entropy
plug_in_entropy

0.9875257101057102

#### Lempel-Ziv Entropy 

Similar to Shannon entropy, Abraham Lempel and Jacob Ziv proposed in 1978 that entropy be treated as a measure of
complexity. Intuitively, a complex sequence contains more information than a regular (predictable) sequence. Based on
this idea, the Lempel-Ziv (LZ) algorithm decomposes a message into a number of non-redundant substrings. LZ entropy builds on this idea by dividing the number of non-redundant substrings by the length of the
original message. The intuition here is that complex messages have high entropy, which will require large dictionaries
of substrings relative to the length of the original message.

In [26]:
# Calculating the Lempel-Ziv entropy of a subset of the tick classifications
lempel_ziv_entropy = entropy.get_lempel_ziv_entropy(message = tick_classifications[:100])

# Previewing the Lempel-Ziv entropy
lempel_ziv_entropy

0.29

#### Kontoyiannis Entropy

In 1998 Kontoyiannis attempted to make more efficient use of the information available in a message by taking advantage of
a technique known as length matching

In [27]:
# Calculating the Kontoyiannis entropy of a subset of the tick classifications
kontoyiannis_entropy = entropy.get_konto_entropy(message = tick_classifications[:100])

# Previewing the Konto entropy
kontoyiannis_entropy

0.8681763656863827

Users may be wondering why some of the entropy functions are yielding values greater than 1. A comprehensive explanation of why this is occurring can be found in the StackExchange thread named [Why am I getting information entropy greater than 1?](https://stats.stackexchange.com/questions/95261/why-am-i-getting-information-entropy-greater-than-1).

The short explanation is that we are calculating our entropy measures using a log function with a base of 2, which has a maximum value that is greater than 1 

## The Feature Matrix <a class="anchor" id="matrix"></a>

Finally, we show the easiest way of taking advantage of the functions in this module, which is to call the `generate_feature_matrix` function.

The user must provide the tick classifications generated by the tick rule and a `lookback_period`, which is the window of entries that the user wants each of the feature transformation functions included in this module to be computed over on a rolling basis

In [28]:
# Generating our feature matrix
feature_matrix = first_gen.generate_feature_matrix(
    tick_prices = tick_prices,
    lookback_period = 5,
    fractional_differencing_amount = 0.453,
    fractional_differencing_threshold = 0.01)

# Previewing the feature matrix
feature_matrix

Unnamed: 0_level_0,roll_spread,fractional_difference,wald_wolfowitz_p_value,shannon_entropy,lempel_ziv_entropy,plug_in_entropy,konto_entropy
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2015-01-02 09:21:26.264,1.802776,1.023294,0.662521,0.970951,0.6,0.811278,0.764160
2015-01-02 09:53:13.935,0.000000,-0.285212,0.414216,0.721928,0.6,0.811278,0.646241
2015-01-02 10:20:26.945,2.061553,-0.898121,0.414216,0.721928,0.6,0.811278,1.000000
2015-01-02 11:05:56.143,1.322876,0.652724,0.512691,0.970951,0.6,0.811278,0.896241
2015-01-02 11:45:14.081,3.763863,-0.621354,0.512691,0.970951,0.6,1.000000,0.896241
...,...,...,...,...,...,...,...
2016-12-30 20:58:37.916,0.000000,86.773701,0.662521,0.970951,0.6,1.000000,0.896241
2016-12-30 20:59:30.587,1.224745,87.280432,0.126630,0.970951,0.6,1.000000,1.000000
2016-12-30 20:59:53.515,1.040833,85.685898,0.662521,0.970951,0.6,0.811278,0.764160
2016-12-30 21:00:21.588,0.000000,84.951707,0.126630,0.970951,0.6,0.811278,0.646241


## Conclusion

This notebook describes functions belonging to the First Generation Models of the Microstructural Features Module from the MlFinLab package.

These tools have been originally presented in the book "Advances in Financial Machine Learning" by Marcos Lopez De Prado (https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086).

Key takeaways from the notebook:

* The Tick Rule is an algorithm used to determine a trade’s aggressor side.

* The Roll Model provides market microstructure model for estimating the effective bid-ask spread of a security.

* The Wald-Wolfowitz Runs test is used to determine how random the classifications are.

* Fractional differencing of the classification series is used to achieve stationarity while simultaneously preserving a high degree of information.

* Various entropy measures determine the amount of information contained in the classification sequence.

## Reference

* Lopez de Prado, M. (2018) Advances in Financial Machine Learning. New York, NY: John Wiley & Sons.