<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#High-Frequency-Trading-&amp;-2010-Market-Flash-Crash" data-toc-modified-id="High-Frequency-Trading-&amp;-2010-Market-Flash-Crash-1">High Frequency Trading &amp; 2010 Market Flash Crash</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1">Introduction</a></span></li><li><span><a href="#Methodology" data-toc-modified-id="Methodology-1.2">Methodology</a></span></li><li><span><a href="#Packages-and-Functions" data-toc-modified-id="Packages-and-Functions-1.3">Packages and Functions</a></span></li><li><span><a href="#Data-Size-and-Exploration" data-toc-modified-id="Data-Size-and-Exploration-1.4">Data Size and Exploration</a></span></li></ul></li><li><span><a href="#OpenMP-Approach-on-Stampede2" data-toc-modified-id="OpenMP-Approach-on-Stampede2-2">OpenMP Approach on Stampede2</a></span><ul class="toc-item"><li><span><a href="#Data-Processing" data-toc-modified-id="Data-Processing-2.1">Data Processing</a></span></li><li><span><a href="#Plots-and-Metrics" data-toc-modified-id="Plots-and-Metrics-2.2">Plots and Metrics</a></span></li><li><span><a href="#Daily-Messages-Volume" data-toc-modified-id="Daily-Messages-Volume-2.3">Daily Messages Volume</a></span></li></ul></li><li><span><a href="#Spark/Hadoop-Approach-on-Wrangler" data-toc-modified-id="Spark/Hadoop-Approach-on-Wrangler-3">Spark/Hadoop Approach on Wrangler</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Notebook-Spark-Setup" data-toc-modified-id="Notebook-Spark-Setup-3.0.1">Notebook Spark Setup</a></span></li></ul></li></ul></li></ul></div>

# High Frequency Trading & 2010 Market Flash Crash

**Jose Luis Rodriguez**

**COMP 488 Machine Learning** 

**Loyola University Chicago**

## Introduction 

As computational capabilities continue to grow exponentially and financial markets around the world are increasingly dependable on automated systems it is worth to spend some time analysis the market depth data of the weeks leading to and the day of May 6, 2010 to have a better understanding of market behavior during this high volatility period and to compare some of this project findings with official report to congress from the CFTC and SEC. The data used in this project is from CME Group and corresponds to market depth transactions in milliseconds of the E-Mini S\&P 500 futures and options contracts. 

## Methodology

The market depth data that the CME Group provides contains all market data messages required to recreate the order book (list of orders that a trading firm uses to record the interest of buyers and sellers in a particular financial instrument.) each message contains between five to ten orders deep in futures markets and three orders deep in options markets this data is time stamped to the millisecond allowing for an in depth analysis of the price movement.

In order to process the large volume of transactions (millions of transactions per week) and the goal is to compute daily, hourly, minute, seconds, millisecond volume and other data metrics it is necessary to implement statistical operations such as distributions, average in parallel as well as filtering map-reduce type jobs are ideal for type of task as the nature of the data (independent transactions) allows for parallel processing in most cases.

## Packages and Functions 
The following packages and functions are common python packages.

In [None]:
import os
import time
import pandas as pd
from datetime import datetime
from collections import OrderedDict
from matplotlib.dates import DateFormatter
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline

## Data Size and Exploration  

In [None]:
%%bash
ls -la -h $DATA/raw/05/ | grep -o "jlroo .*." |  tail -n +3

In [None]:
%%bash
ls -la -h $HOME/project/XCME_20100503 | grep -o "jlroo .*."

In [None]:
%%bash
tail -n 5 $HOME/project/XCME_20100503

# OpenMP Approach on Stampede2

The application `fixanalyzer`, was compiled using the intel icpc g++ compiler with openMP pragmas enable in order to process and analyze the data in parallel. 

A bash command is run to execute the application and run some benchmarks. The output is pipe to python to generate some plots and metrics. 

In [None]:
%%bash --out output
for t in 1 2 4 6 8 16 32 64;
do OMP_NUM_THREADS=$t;
echo $t
$HOME/fixanalyzer -p $HOME/project/XCME_20100503 -t 52 -n 4 -m 8; 
echo " "
done

## Data Processing

In [None]:
stdout = output.split("\n")[:-1]
stdout[:8]

In [None]:
dates = [datetime.strptime(d,'%Y%m%d').date() for d in dates]
date_volume = [[int(stdout[i].split(",")[1]) 
                for i in range(k,len(stdout),8)][0] for k in range(2,5,1)]
metrics = [{"threads":i[0],
            "total_msgs":i[1][0],
            "read_time":i[1][1],
            "search_time":i[1][2],
            "volume_time":i[1][3]} for i in search]

## Plots and Metrics

In [None]:
scaler = StandardScaler()
data = pd.DataFrame.from_dict(metrics)
data = data.convert_objects(convert_numeric=True)

In [None]:
data

In [None]:
X = data['threads']
read_time = data['read_time']
search_time = data['search_time']
volume_time = data['volume_time']

# plot with various axes scales
plt.figure(figsize=(15,15))

# linear
plt.subplot(221)
plt.plot(X, read_time)
#plt.xscale('log')
plt.title('linear')
plt.grid(True)

# linear
plt.subplot(222)
plt.plot(X, search_time)
#plt.xscale('log')
plt.title('linear')
plt.grid(True)

# linear
plt.subplot(223)
plt.plot(X, volume_time)
#plt.xscale('log')
plt.title('linear')
plt.grid(True)

# linear
plt.subplot(224)
plt.plot(read_time)
plt.plot(search_time)
plt.plot(volume_time)
plt.yscale('log')
plt.title('linear')
plt.grid(True)

plt.subplots_adjust(top=0.95, bottom=0.10, left=0.10, right=0.95, hspace=0.25, wspace=0.35)

plt.show()

## Daily Messages Volume

In [None]:
plt.figure(figsize=(12,12))
fig = plt.subplot()
fig.plot(dates, date_volume)
fig.figure.autofmt_xdate()
myFmt = DateFormatter("%Y-%m-%d")
fig.xaxis.set_major_formatter(myFmt)
plt.yscale('log')
plt.grid(True)
plt.show()

# Spark/Hadoop Approach on Wrangler

### Notebook Spark Setup 

The package findspark is very important as it will help to find Spark in the system. If the enviroment variable SPARK_HOME is setup then just need to run `findspark.find()`. It is also possible to give the direct path to the find Spark using the `findspark.init(/path/to/spark/)`

In [None]:
import findspark 
findspark.init('/usr/lib/spark', edit_profile=True)
findspark.find()

After finding spark then it is possible to `import pyspark` and start the `SparkContext`

In [None]:
import pyspark
sc = pyspark.SparkContext(appName="fixAnalyzer")

In [17]:
path = "/data/05191/jlroo/raw"
files = list(os.walk(path))[0][2]
hdfs = "hdfs://" + "/user/jlroo/cme/"

In [None]:
wkfiles = {int(i.split("_")[-2]):sc.textFile(hdfs + i) for i in files}
data = OrderedDict(sorted(wkfiles.items(), key=lambda t: t[0]))

In [None]:
count = {key:data[key].count() for key in data}

In [None]:
df = data['20100409'].map(lambda r: Row(r)).toDF(["line"])