# Pairs Trading: A Statistical Arbitrage Strategy

# Outline
- Introduction
- Pairs trading: What is it?
- Problem statement
- Proposed solution
    - Gather data (API and web scraping)
    - Unsupervised clustering (K-means)
    - Evaluate pairs (time series analysis: cointegration)
    - Trading example
- Final thoughts

# Introduction
- Over 15 years of experience in the financial markets (New York, Toronto, Victoria)
    - Risk management and performance measurement for the buy-side (hedge funds, pension funds, endowments, etc.)
- Bachelors degree from UBC in Vancouver
- MBA from Columbia University in New York City

<img src = '00_schools.png' width='150'>


- Why Lighthouse Labs?
    - After 15 years in finance, I wanted to apply new ideas to what I was doing: machine learning and data science
    - Since I still have a passion for finance, I wanted to learn how to apply machine learning and data science to financial modeling

# Pairs trading: Why?

- I want to demystify this strategy
- It has had a long and successful history on Wall Street, but it has long been in the domain of quants at major institutional firms (Morgan Stanley, DE Shaw, LTCM)

- My idea is to implement this strategy using information that is freely available to anyone and with tools that anyone can install on their machine (Python, Jupyter Lab, etc.)

# Pairs trading: What is it?

### Basic idea: 
- Select two stocks that move similarly > Find when they diverge > Sell overvalued stock, buy undervalued stock > Exit when they converge again and pocket a profit
- The idea is to capture absolute returns regardless of market fluctuations
<img src = '00.png'>

# Problem statement
- __The major problem is to identify appropriate pairs of stocks to trade__
- There are 1500 stocks that are liquidly traded on the North American stock markets
    - __1,124,250__ possible pairs to evaluate
- For a major index like the S&P 500
    - __127,260__ possible pairs to evaluate
- And for an even larger major index like the Russell 3000
    - __4,498,500__ possible pairs to evaulate

- __We need a way to narrow down the number of pairs__
- __We also need to evaluate each of reduced number of pairs__

# Proposed solution

- Use a machine learning methodology to cluster (or group) a bunch of stocks together
    - This clustering will be based on the companies' details (sector, financial ratios) and daily returns

- For each cluster, identify the pairs that are within the same cluster

- For each pair in the cluster, determine if it is appropriate for pairs trading

- Companies there are left unclustered are outliers and will __not__ be considered for pairs trading

- The clustering will be based on 2016 - 2018, so we will test a pairs trading strategy in 2019

# Gathering data
We are going to use the S&P 500 as our universe of stocks

- Scrape Wikipedia to get the tickers, sectors, and subsectors of every company in the S&P 500
<img src = '01.png'>

# Gathering data
- Scrape Yahoo to get balance sheet, income statement, and cashflow statement
- We can then compute financial ratios that will indicate financial health and profitability
<img src = '02.png' width='750'>

# Gathering data
- Call the free Alpha Vantage API to get daily prices for each company
- From these prices, we can compute the daily returns

<img src = '03.png'>

# Gathering data

- We now have our data to feed into the machine learning model: 
    - Daily returns
    - Sector information
    - Financial ratios



- The idea is that companies with similar information will make good candidates for pairing

# Clustering: Preprocessing
Before we start clustering the companies, we need to preprocess the data a little

- Reduce the number of daily return dimensions using principal component analysis
    - Reduced 754 datapoints to 170 principal components for each company

- Append financial ratios to the dataset

- One-hot encode the sectors and append to the dataset

Finaly, we can feed the combined dataset into a clustering method: K-means clustering

# Clustering: Results

K-means clustering discovered __53__ clusters

<img src = '04.png'>

# Clustering: Results


<img src = '05.png' width='400'>

# Clustering: Results
Inspect one of the clusters

<img src = '06.png'>

# Clustering: Results

Sample some random names

The tickers clustered with __NOC (Northrop Grumman)__ are: 
- __LMT__: Lockheed Martin Corp.
- __NOC__: Northrop Grumman
- __TXT__: Textron Inc.

The tickers clustered with __MS (Morgan Stanley)__ are: 
- __GS__: Goldman Sachs Group
- __MS__: Morgan Stanley
- __UNP__: Union Pacific Corp

# Clustering: Results

The tickers clustered with __BAC (Bank of America Corp)__ are: 
- __BAC__: Bank of America Corp
- __CSCO__: Cisco Systems
- __C__: Citigroup Inc.
- __JPM__: JPMorgan Chase & Co.
- __NTRS__: Northern Trust Corp.
- __OXY__: Occidental Petroleum
- __SLB__: Schlumberger Ltd.
- __WFC__: Wells Fargo

# Evaluating candidate pairs for trading suitability

Loop through all __856__ candidate pairs and determine which ones are suitable for pairs trading

First inclination is to use the __correlation coefficient__ to measure the strength of the trading relationship...

...but correlation does not tell us if the spread between the two stocks is stationary or mean-reverting

Instead we are going to use a concept called __cointegration__.

__Cointegration__ measures whether the spread between two stocks through history is stationary, where spread is defined as:
`spread = ln(stock A) - n x ln(stock B)`

- 'n' is a hedge ratio and is determined by a linear regression applied to the historical prices of stock A and stock B (i.e., how much to buy of stock B per stock A)

# Evaluating candidate pairs for trading suitability
- Use the Augmented Engle-Granger test for cointegration, which will return a p-value.
- I decided to use a stringent significance value for the test: Suitable pairs must have p-values less than __0.01__

- This resulted in __19__ identified pairs. There are __27__ unique tickers in those pairs.

[('AME', 'HON'),
 ('DOV', 'HON'),
 ('DOV', 'SNPS'),
 ('ETN', 'RTX'),
 ('EMR', 'HD'),
 ('EMR', 'RTX'),
 ('AWK', 'WEC'),
 ('GL', 'L'),
 ('L', 'RJF'),
 ('MMC', 'SRE'),
 ('MMC', 'WLTW'),
 ('ORCL', 'SRE'),
 ('ORCL', 'WLTW'),
 ('SRE', 'WLTW'),
 ('CMA', 'RF'),
 ('MXIM', 'TXN'),
 ('PNR', 'SWK'),
 ('NWSA', 'NWS'),
 ('IP', 'WRK')]

# Evaluating candidate pairs for trading suitability
# Recap of where we are
<img src = '07.png' width='400'>

# Trading example

__PNR__ and __SWK__ 
-  __Pentair plc__: Industrials (Industrial Machinery)
-  __Stanley Black & Decker__: Industrials (Industrial Machinery)

<img src = '09.png'>

# Final thoughts
- __Challenges__:
    - Need to space out my calls to Yahoo! Finance and Alpha Vantage API
        - 24 hours to get pricing information
        - 5 hours to get financial statements
    - Need to find an appropriate 'k' for K-means clustering
    
    
- __Future considerations__:
    - Use a larger universe of stocks and larger set of indicators (more financial ratios)
    - Perhaps shorten the window of data used for clustering
    - Experiment with other clustering techniques (i.e., DBSCAN, hierarchical)
    - Loosen the significance of the Augmented Engle-Granger test for cointegration to capture more pairs
    - Do a portfolio analysis on many pairs trades over many business cycles to see if absolute returns are achieved over different market fluctuations

# Thank you