# Entropy

## Introduction

> **Underlying Literature**: The following module was inspired by the ideas put forward in Chapter 18 of [Advances in Financial Machine Learning](https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086) by Marcos Lopez de Prado

Entropy measures in financial sciences aim to discern how much information is contained in a given time series. That said, when markets are not perfect, prices are formed with partial information. As a result, entropy measures are helpful in determining just how much useful information is contained in said price signals.


### Table of Contents
- [Shannon Entropy](#shannon)
- [Plug-in Entropy](#plug)
- [Lempel-Ziv Entropy](#lz)
- [Kontoyiannis Entropy](#konto)
- [Binary Encoding](#binary)
- [Quantile Encoding](#quantile)
- [Sigma Encoding](#sigma)

Before starting, we must first import our tick data from the sample data folder and generate tick classifications from it

In [3]:
# Importing packages
import pandas as pd

from mlfinlab.microstructural_features import (first_generation as first_gen,
                                               entropy)

In [4]:
# Reading in the tick data and only storing the closing price
url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/tick_bars.csv"
tick_prices = pd.read_csv(url, index_col=0)['close']

# Generating trade classifications for our tick data using the tick rule
tick_classifications = first_gen.tick_rule(prices = tick_prices)

# Previewing the tick classifications
tick_classifications.head()

date_time
2015-01-01 23:16:58.834    1.0
2015-01-02 00:36:33.094   -1.0
2015-01-02 01:54:08.770    1.0
2015-01-02 04:28:09.015    1.0
2015-01-02 06:57:53.850    1.0
dtype: float64

Now, we can apply our entropy measures to these tick classifications. For the purposes of this notebook, we'll be applying the functions to a subset of our tick classifications

## Shannon Entropy <a class="anchor" id="shannon"></a>

Claude Shannon is credited with having one of the first conceptualizations of entropy in 1948, which he defined as the
average amount of information produced by a stationary source of data. More robustly defined, entropy is the smallest
number of bits per character required to describe a message in a uniquely decodable way

In [5]:
# Calculating the Shannon entropy of a subset of the tick classifications
shannon_entropy = entropy.get_shannon_entropy(message = tick_classifications[:100])

# Previewing the Shannon entropy
shannon_entropy

0.9895875212220555

## Plug-in Entropy <a class="anchor" id="plug"></a>

Gao et al. (2008) built on the work done by Shannon by conceptualizing the Plug-in measure of entropy, also known as
the maximum likelihood estimator of entropy. Given a data sequence $x_{1}^{n}$, comprising the string of values starting in position 1 and ending in position
:$n$, we can form a dictionary of all words of length $w < n$ in that sequence: $A^w$.

In [6]:
# Calculating the Plug-in entropy of a subset of the tick classifications
plug_in_entropy = entropy.get_plug_in_entropy(message = tick_classifications[:100])

# Previewing the Plug-in entropy
plug_in_entropy

0.9875257101057102

## Lempel-Ziv Entropy <a class="anchor" id="lz"></a>

Similar to Shannon entropy, Abraham Lempel and Jacob Ziv proposed in 1978 that entropy be treated as a measure of
complexity. Intuitively, a complex sequence contains more information than a regular (predictable) sequence. Based on
this idea, the Lempel-Ziv (LZ) algorithm decomposes a message into a number of non-redundant substrings. LZ entropy builds on this idea by dividing the number of non-redundant substrings by the length of the
original message. The intuition here is that complex messages have high entropy, which will require large dictionaries
of substrings relative to the length of the original message.

In [7]:
# Calculating the Lempel-Ziv entropy of a subset of the tick classifications
lempel_ziv_entropy = entropy.get_lempel_ziv_entropy(message = tick_classifications[:100])

# Previewing the Lempel-Ziv entropy
lempel_ziv_entropy

0.29

## Kontoyiannis Entropy <a class="anchor" id="konto"></a>

In 1998 Kontoyiannis attempted to make more efficient use of the information available in a message by taking advantage of
a technique known as length matching

In [8]:
# Calculating the Kontoyiannis entropy of a subset of the tick classifications
kontoyiannis_entropy = entropy.get_konto_entropy(message = tick_classifications[:100])

# Previewing the Konto entropy
kontoyiannis_entropy

0.8681763656863827

Users may be wondering why some of the entropy functions are yielding values greater than 1. A comprehensive explanation of why this is occurring can be found in the StackExchange thread named [Why am I getting information entropy greater than 1?](https://stats.stackexchange.com/questions/95261/why-am-i-getting-information-entropy-greater-than-1).

The short explanation is that we are calculating our entropy measures using a log function with a base of 2, which has a maximum value that is greater than 1 

Encoded messages can also be used to caluclate entropy

Before we start we need to import the packages used for encoding

In [9]:
# Importing packages
import pandas as pd
import numpy as np

## Binary Encoding<a class="anchor" id="binary"></a>

In the case of returns series derived from price bars (i.e., bars containing prices fluctuating between two symmetric
horizontal barriers, centered around the start price), binary encoding occurs naturally because the value of $|r_{t}|$
is roughly constant. For example, a stream of returns :math:`r_{t}` can be encoded according to the sign,
with 1 indicating $r_{t} > 0$ and $0$ indicating $r_{t} < 0$, thus eliminating occurrences where $r_{t} = 0$.

In [10]:
# Import MlFinLab tools
from mlfinlab.microstructural_features.encoding import encode_tick_rule_array

# Create data and use tools
values = tick_classifications[:100].values
encoded_tick_rule = encode_tick_rule_array(values)

# Plug-in Entropy 
plug_in_entropy_binary_message = entropy.get_plug_in_entropy(message = encoded_tick_rule)
print('Plug-in Entropy', plug_in_entropy_binary_message)

Plug-in Entropy 0.9875257101057102


## Quantile Encoding<a class="anchor" id="quantile"></a>

Unless price bars are employed, more than two codes are likely to be required. One method is to assign a code to each
$r_{t}$ based on the quantile to which it belongs. An in-sample period is used to calculate the quantile limits (training set).
In the overall in-sample, each letter will receive the same amount of observations, and out-of-sample, each letter will
receive close to the same number of observations. Some codes span a larger portion of $r_{t}$'s range than others when utilizing
the approach. This uniform (in-sample) or nearly uniform (out-of-sample) distribution of codes increases average entropy readings.

In [11]:
# Import MlFinLab tools
from mlfinlab.microstructural_features.encoding import (quantile_mapping, encode_array)

# Create data and use tools
values = tick_classifications[:100].values
quantile_dict = quantile_mapping(values, num_letters=10)
message = encode_array(values, quantile_dict)

# Plug-in Entropy 
plug_in_entropy_quantile_message = entropy.get_plug_in_entropy(message = message)
print('Plug-in Entropy', plug_in_entropy_quantile_message)

Plug-in Entropy 0.9875257101057102


## Sigma Encoding<a class="anchor" id="sigma"></a>

Rather than limiting the amount of codes, we may instead let the price stream define the actual vocabulary.
Let's say we want to fix a discretization step, ${\sigma}$.
Then we assign 0 to $r_{t} \in[\min \{r\}, \min \{r\}+\sigma), 1$ to $r_{t} \in$
$[\min \{r\}+\sigma, \min \{r\}+2 \sigma$,
and so on until every observation has been encoded with a total of ceil
$\left[\frac{\max \{r\}-\min \{r\}}{\sigma}\right]$.
The ceil[.] denotes the ceiling function.
Unlike quantile encoding, each code now covers the same fraction of the range of $r_{t}$'s.
Entropy readings will be smaller than in quantile encoding on average because codes are not uniformly distributed;
however, the introduction of a "rare" code will cause entropy measurements to surge.

In [12]:
# Import MlFinLab tools
from mlfinlab.microstructural_features.encoding import (sigma_mapping, encode_array)

# Create data and use tools
values = tick_classifications[:100].values
sigma_dict = sigma_mapping(values, step=1) 
message = encode_array(values, sigma_dict)

# Plug-in Entropy 
plug_in_entropy_sigma_message = entropy.get_plug_in_entropy(message = message)
print('Plug-in Entropy', plug_in_entropy_sigma_message)

Plug-in Entropy 0.9875257101057102


## Conclusion

This notebook describes different entropy measures and message encoding methods implemented in the MlFinLab package.

These estimators have been originally presented in the book "Advances in Financial Machine Learning" by Marcos Lopez De Prado (https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086).

Key takeaways from the notebook:

* Entropy measures are helpful in determining just how much useful information is contained in said price signals.

* Entropy output an be greater than 1 in some cases. 

* Binary, Quantile and Sigma ecoded messages can be used to claculate these Entropy measures.

* Quantile and Sigma encoding need an additional function to encode a dictionary before it can be used to claculate the Entropy measures.

## Reference

* Lopez de Prado, M. (2018) Advances in Financial Machine Learning. New York, NY: John Wiley & Sons.