# CS 6780: Advanced Machine Learning

### <i>Enhancing Pairs Trading: The Power of Unsupervised Learning Approaches</i>

## Methodology

Our features contain at a high level two key pieces of information: (i) the returns of the stock on a monthly basis over the last 72 months and (ii) the firm characteristics of the stock on a quarterly basis over the last 72 months. In so doing, we capture both quantitative and qualitative aspects of the stock. 

We obtain our data from the Center for Research in Security Prices (CRSP), focusing on stocks with common shares listed on the New York Stock Exchange (NYSE), American Stock Exchange (AMEX), and Nasdaq. Our sampling period is 1/2010 to 1/2016.

In [214]:
import pandas as pd

### 1. Returns of the Stock Dataset: Filtering, Cleaning, and Generating

We must clean and filter our data. We do not consider stocks that are delisted as we cannot enter trades with them nor do we consider stocks that are missing more than 25% of their data. We also disregard stocks that have low trading volumes as they are illiquid and hence are potentially more volatile. 

In [215]:
# Import and peak at stock dataset
df = pd.read_csv('/Users/kevinwon/Desktop/quant/data.csv')
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')
df.head()


Unnamed: 0,PERMNO,date,SHRCD,EXCHCD,TICKER,DLSTCD,PRC,VOL,RET,SHROUT
0,10001,2010-01-29,11,2,EGAS,,10.06,3104.0,-0.018932,4361.0
1,10001,2010-02-26,11,2,EGAS,,10.0084,1510.0,-0.000656,4361.0
2,10001,2010-03-31,11,2,EGAS,,10.17,2283.0,0.020643,4361.0
3,10001,2010-04-30,11,2,EGAS,,11.39,3350.0,0.124385,6070.0
4,10001,2010-05-28,11,2,EGAS,,11.4,3451.0,0.004829,6071.0


In [216]:
print("Number of rows in dataset: ", len(df))
print("Number of stocks in dataset: ", len(set(df['PERMNO'])))

Number of rows in dataset:  278179
Number of stocks in dataset:  5341


#### Remove Delisted Stocks

Below are the delisting codes:

1) still trading or halted but not yet delisted
2) merger
3) exchange
4) liquidation
5) delisted by NYSE, AMEX, or Nasdaq
7) delisted by the Securities and Exchange Commission
8) trading simultaneously on more than one exchange

We remove stocks with a delisting code of 

2) because stocks undergoing mergers experience a change in their trading dynamics
3) because stocks placed in a different exchange might affect its volatility/liquidity
4) because liquidation often leads to delisting and can greatly affect the stock's price
5) because the stock is delisted
7) because the stock is delisted

In [217]:
# Convert 'DLSTCD' to a string and pad with zeros to ensure 3 digits
df['DLSTCD_str'] = df['DLSTCD'].astype(str).str.pad(3, fillchar='0')

# Filter the DataFrame based on the condition that the hundredth digit is 2, 3, 4, 5, or 7
delisted_stocks = df[df['DLSTCD_str'].str[0].isin(['2', '3', '4', '5', '7'])]

# Extract the PERMNO identifiers for these filtered rows
permno_list_to_remove_delist = list(set(delisted_stocks['PERMNO']))

#### Remove Illiquid Stocks

To do this, we add a "turnover" column to our dataset, defined as trading volume divided by shares outstanding, and set a threshold for liquidity to exclude stocks.

To determine a good liquidity threshold for excluding stocks with low turnover, we conduct various statistical analyses on the turnover data. We calculate percentiles, mean, median, and standard deviation to help set a threshold.

In [218]:
# Add a "turnover" column to the dataset
df['Turnover'] = df['VOL']/df['SHROUT']
df.head()

# Basic statistics
mean_turnover = df['Turnover'].mean()
median_turnover = df['Turnover'].median()
std_dev_turnover = df['Turnover'].std()

print(f"Mean Turnover: {mean_turnover}")
print(f"Median Turnover: {median_turnover}")
print(f"Standard Deviation of Turnover: {std_dev_turnover}")

# Percentile analysis
percentiles = [10, 25, 50, 75, 90]
percentile_values = df['Turnover'].quantile([p / 100 for p in percentiles]).to_dict()

print("\nTurnover Percentiles:")
for percentile, value in percentile_values.items():
    print(f"{percentile * 100}th percentile: {value}")

# Exclude stocks below the 25th percentile
threshold = percentile_values[0.25]
print(f"\nSuggested Threshold (25th percentile): {threshold}")

illiquid_stocks = df[df['Turnover'] <= threshold] 
permno_list_to_remove_illiquid = list(set(illiquid_stocks['PERMNO']))

Mean Turnover: 1.753014015862979
Median Turnover: 1.128333464703306
Standard Deviation of Turnover: 3.622119669395907

Turnover Percentiles:
10.0th percentile: 0.17888617676375773
25.0th percentile: 0.4934275044261027
50.0th percentile: 1.128333464703306
75.0th percentile: 2.093354380353686
90.0th percentile: 3.615274512236201

Suggested Threshold (25th percentile): 0.4934275044261027


However, we also want to ensure that only stocks with a significant history of low turnover are considered, reducing the chance of excluding stocks due to short-term anomalies or data issues. Let's also remove stocks with prices less than or equal to 0.

In [219]:
for item in permno_list_to_remove_illiquid:
    temp = illiquid_stocks.loc[illiquid_stocks['PERMNO'] == item]
    if len(temp) < 3:                    
        permno_list_to_remove_illiquid.remove(item)

We now remove all stocks from the dataset based on the above conditions.

In [220]:
final_list = list(set(permno_list_to_remove_delist) | set(permno_list_to_remove_negative) | set(permno_list_to_remove_illiquid))
filtered_df = df.loc[df['PERMNO'].isin(final_list) == False]

print("Dataset size: ", len(filtered_df))
print("Number of stocks: ", len(set(filtered_df['PERMNO'])))
filtered_df.head()

Dataset size:  119737
Number of stocks:  1995


Unnamed: 0,PERMNO,date,SHRCD,EXCHCD,TICKER,DLSTCD,PRC,VOL,RET,SHROUT,DLSTCD_str,Turnover
323,10032,2010-01-29,11,3,PLXS,,34.01,91386.0,0.194171,39774.0,,2.297632
324,10032,2010-02-26,11,3,PLXS,,34.49,66482.0,0.014114,39774.0,,1.671494
325,10032,2010-03-31,11,3,PLXS,,36.03,53286.0,0.044651,39774.0,,1.339719
326,10032,2010-04-30,11,3,PLXS,,37.08,56600.0,0.029142,40359.0,,1.402413
327,10032,2010-05-28,11,3,PLXS,,34.05,61408.0,-0.081715,40359.0,,1.521544


Let's now create the actual feature vectors! Here is how we do this:

We transform our dataset into a pivot table, organizing stock return data ('RET') with stocks ('PERMNO') as rows and dates as columns. We then clean the data by removing stocks with over 25% missing values, ensuring a robust dataset. Remaining missing values are filled using a forward-fill method limited to 5 consecutive fills, balancing data integrity with practical imputation. 

In [221]:
# Convert 'RET' to numeric, coercing errors to NaN
filtered_df['RET'] = pd.to_numeric(filtered_df['RET'], errors='coerce')

# Get the pivot table, with stock permno as index and datetime as columns
df_price = pd.pivot_table(filtered_df, values = 'RET', index = 'PERMNO', columns = 'date')

# Delete stocks with more than half nan price for clustering
df_price = df_price.dropna(thresh = 3*len(df_price.columns)//4)
df_price = df_price.fillna(method = 'ffill', axis = 1, limit = 5)

df_price.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['RET'] = pd.to_numeric(filtered_df['RET'], errors='coerce')
  df_price = df_price.fillna(method = 'ffill', axis = 1, limit = 5)


date,2010-01-29,2010-02-26,2010-03-31,2010-04-30,2010-05-28,2010-06-30,2010-07-30,2010-08-31,2010-09-30,2010-10-29,...,2015-04-30,2015-05-29,2015-06-30,2015-07-31,2015-08-31,2015-09-30,2015-10-30,2015-11-30,2015-12-31,2016-01-29
PERMNO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10032,0.194171,0.014114,0.044651,0.029142,-0.081715,-0.214684,0.091997,-0.211901,0.275394,0.034072,...,0.055923,0.056446,-0.03518,-0.130811,-0.001835,0.013396,-0.102644,0.074523,-0.06129,0.000859
10051,0.175705,0.146986,-0.025201,0.025302,-0.0853,0.053372,-0.0451,-0.239067,0.114176,0.287483,...,-0.015425,0.029096,0.019574,-0.076792,-0.171442,-0.239264,0.057185,0.07975,0.056519,-0.179939
10107,-0.075459,0.022001,0.021538,0.042595,-0.150811,-0.10814,0.121686,-0.085819,0.043682,0.088812,...,0.196409,-0.030222,-0.057832,0.057758,-0.061456,0.017004,0.189336,0.039324,0.020791,-0.00703
10138,-0.068169,0.021564,0.089761,0.046753,-0.139381,-0.098142,0.086506,-0.092059,0.149463,0.103965,...,0.027167,-0.006036,-0.030239,-0.007719,-0.068067,-0.025876,0.088058,0.007009,-0.054366,-0.007554
10145,-0.014286,0.047166,0.127241,0.048597,-0.092637,-0.087445,0.09813,-0.081603,0.124936,0.072144,...,-0.032499,0.037629,-0.021401,0.030205,-0.050095,-0.046137,0.090717,0.012248,-0.003656,-0.003572


In [222]:
print("Number of training stocks: ", len(df_price))

Number of training stocks:  1499
