# Replicating Transfer Entropy Code in Python

In Anderson and McMullin, "Detecting Information Flows in Markets" 2018 **[1]**. They compute Transfer Entropy using Kraskov, A., H. Stogbauer, and P. Grassberger (2004, Jun). Estimating mutual information. Phys. Rev. E 69, 066138 **[2]**.

McMullin referred us to the implementation of **[2]** that they use which is here: https://github.com/Healthcast/TransEnt/blob/master/src/compute_TE.cpp. McMullin recommended that we use to the MI_diff function in the compute_TE.cpp file because it is better empirically and theoretically.

The rest of this notebook implements compute_TE.cpp using the sample data that McMullin sent us which can be found here: https://uofi.box.com/s/ror93i0ed7ky4hzn38ljyc22i5o0p4zd. 

----
**Note:** To gain access to the sample data contact either: ikegwu2 [at] illinois or bigdog [at] illinois

----
The code cell below imports libraries needed for this notebook:

In [1]:
# Standard Imports

from IPython.display import display

import numpy as np  # to handle data
import pandas as pd  # to handle data
import os  # to handle manipulation of files
from sklearn.neighbors import KDTree  # to compute distance

from numpy.testing import assert_array_equal  # To check certain operations

The code cell below will obtain the file names from the sample data and output the first 10 files.

In [2]:
# Directory where data is stored
dir_path = '/Users/jarvis/Downloads/spy_09_21_18_updated/'  # Yes my computer's name is jarvis

files = os.listdir(dir_path)  # Grab files in directory
print("Files:", sorted(files)[0:9])  # display first 10 files
transfer_entropy = {}  # contains values for transfer entropy

Files: ['SPY_US_09_21_18_C100_Equity_ask.dat', 'SPY_US_09_21_18_C100_Equity_bid.dat', 'SPY_US_09_21_18_C100_Equity_last.dat', 'SPY_US_09_21_18_C105_Equity_ask.dat', 'SPY_US_09_21_18_C105_Equity_bid.dat', 'SPY_US_09_21_18_C105_Equity_last.dat', 'SPY_US_09_21_18_C115_Equity_ask.dat', 'SPY_US_09_21_18_C115_Equity_bid.dat', 'SPY_US_09_21_18_C115_Equity_last.dat']


### Describing the File Names & Data: 


1. This filename: "SPY_US_09_21_18_C100_Equity_ask.dat": contains the prices for the SPY security and the Call option with a strike price of $100. The date is 09.21.2018. The Prices are recorded from bloomberg's ask variable. The prices are measured at every two seconds. The file has 2 columns where the first column is the price of the scurity and the second column is the price of the optional.

2. This filename: "SPY_US_09_21_18_C100_Equity_bid.dat" contains the same information as the first item except the prices are recorded from bloomberg's bid variable.

3. This filename: "SPY_US_09_21_18_C100_Equity_last.dat" contains the same information as the first & second items except the prices are recorded from bloomberg's last variable.

Generally you have 3 files for a particular stike price.

-----

__Note:__ Ask Jeff about Bloomberg data, in particular some reference for description of data. I cannot find any offical documentation so I assume that bid is the max price that a buyer is willing to pay for a security, ask is the min price that a seller is willing to recieve for selling a security and last is simply the price at which the last trade occurred.

-----

We'll work with 1 file for now and then eventually apply the concepts presented below to all of data. The code cell below reads in data for: "SPY_US_09_21_18_C100_Equity_ask.dat"

In [3]:
# Reading in data
df_100 = pd.read_csv(dir_path + "SPY_US_09_21_18_C100_Equity_ask.dat",
                     delimiter=' ',  # The security price and option price is seperated by a space
                     comment='#',  # Some lines have missing data and start with a #. I ignore these lines
                     names=['security price', 'option price'])

# Displaying first 10 rows of data
display(df_100.head(10))

Unnamed: 0,security price,option price
0,286.99,187.99
1,287.0,187.15
2,286.99,187.15
3,287.0,188.01
4,286.96,187.15
5,287.01,187.15
6,286.97,187.15
7,286.99,187.15
8,287.0,187.13
9,287.04,187.16


### Safety Check Function

Taken from: https://github.com/ikegwukc/TransEnt/blob/master/man/computeTE.Rd#L51
> A more subtle error can occur when multiple points in X^(k) (or some of its subspaces) have identical coordinates. This can lead to several points which have identical distance to a query point, which violates the assumptions of the Kraskov estimator, causing it to throw an error. The solution in this case is to add some small noise to the measurements 

In [compute_TE.cpp](https://github.com/ikegwukc/TransEnt/blob/master/src/compute_TE.cpp) there's a function called `safetyCheck` that checks for duplicate points in each column, if there are duplicate points the program stops and recommends that you add noise to your data.  

In the first 10 rows displayed above:
row 0 & 2 have the points in the security price column and row 1 & 2 have duplicate points in the option price column and so on. Ergo, if this data is used with compute_TE.cpp it will break and [compute_TE.cpp](https://github.com/ikegwukc/TransEnt/blob/master/src/compute_TE.cpp) will recommend that you use the `safetyCheck` function with your data which indicate that you need to add some noise to your data.

In the code cell below I find all duplicate points for each column and add a small amount of noise to  the duplicate points.

-----
**Note:**
- Ask about how McMullin handled this in R. 
    - Do they add a bit of noise to all points? Or just to duplicates?
    - How do they handle duplicates?
        - After first occurance or last occurance?
    - In Section 3.2 of [1] they use noise from normal distribution for an example. For now I assume they use the standard normal distribution to generate noise with actual data.

In [4]:
# Random number generator with a seed set to a constant for reproducability
random_state = np.random.RandomState(seed=23)

# Making a copy of the orginal data since we will modify it below
df_100_bak = df_100.copy()

# Find duplicates for security price column.
dups = df_100.duplicated(subset="security price", keep='first')

# Select indicies where there a duplicates
idx =  df_100.index[dups].tolist()

# Select all of indicies where there are duplicate points
# and add a random value to each duplicate point 
# from the standard normal distribution

df_100['security price'].loc[idx] = np.add(df_100['security price'].loc[idx].values,
                                           random_state.randn(len(idx)))

# Next we will perform a check to make sure that
# only the values with duplicate points were changed

assert_array_equal(df_100['security price'] != df_100_bak['security price'], dups,
                  err_msg="In addition to duplicate points you modified points that were different.")

# We will now add noise to duplicate points for the option price column.

dups = df_100_bak.duplicated(subset="option price", keep='first')  # Find duplicates

idx =  df_100_bak.index[dups].tolist()  # select those indicies

# add random noise to those industries
df_100['option price'].loc[idx] = np.add(df_100['option price'].loc[idx].values,
                                           random_state.randn(len(idx)))

assert_array_equal(df_100['option price'] != df_100_bak['option price'], dups,
                  err_msg="In addition to duplicate points you modified points that were different.")

In [5]:
# Writing to csv to compute in R to compare to final value.
df_100.to_csv("./testData/df_100_nodup.csv", index=False, sep=" ")