# High Frequency Trading & 2010 Market Flash Crash

**Jose Luis Rodriguez**

**COMP 488 Machine Learning** 

**Loyola University Chicago**

## Introduction 

As computational capabilities continue to grow exponentially and financial markets around the world are increasingly dependable on automated systems it is worth to spend some time analysis the market depth data of the weeks leading to and the day of May 6, 2010 to have a better understanding of market behavior during this high volatility period and to compare some of this project findings with official report to congress from the CFTC and SEC. The data used in this project is from CME Group and corresponds to market depth transactions in milliseconds of the E-Mini S\&P 500 futures and options contracts. 

## Methodology

The market depth data that the CME Group provides contains all market data messages required to recreate the order book (list of orders that a trading firm uses to record the interest of buyers and sellers in a particular financial instrument.) each message contains between five to ten orders deep in futures markets and three orders deep in options markets this data is time stamped to the millisecond allowing for an in depth analysis of the price movement.

In order to process the large volume of transactions (millions of transactions per week) and the goal is to compute daily, hourly, minute, seconds, millisecond volume and other data metrics it is necessary to implement statistical operations such as distributions, average in parallel as well as filtering map-reduce type jobs are ideal for type of task as the nature of the data (independent transactions) allows for parallel processing in most cases.

## Packages and Functions

In [None]:
import os
import pandas as pd
from datetime import datetime
from collections import OrderedDict
from matplotlib.dates import DateFormatter
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline

### Notebook Spark Setup 

The package findspark is very important as it will help to find Spark in the system. If the enviroment variable SPARK_HOME is setup then just need to run `findspark.find()`. It is also possible to give the direct path to the find Spark using the `findspark.init(/path/to/spark/)`

In [None]:
import findspark 
findspark.init('/usr/lib/spark', edit_profile=True)
findspark.find()

After finding spark then it is possible to `import pyspark` and start the `SparkContext`

In [None]:
import pyspark
sc = pyspark.SparkContext(appName="fixAnalyzer")

In [17]:
path = "/data/05191/jlroo/raw"
files = list(os.walk(path))[0][2]
hdfs = "hdfs://" + "/user/jlroo/cme/"

In [None]:
wkfiles = {int(i.split("_")[-2]):sc.textFile(hdfs + i) for i in files}
data = OrderedDict(sorted(wkfiles.items(), key=lambda t: t[0]))

In [None]:
count = {key:data[key].count() for key in data}

In [None]:
df = data['20100409'].map(lambda r: Row(r)).toDF(["line"])