# CS 6780: Advanced Machine Learning

### <i>Enhancing Pairs Trading: The Power of Unsupervised Learning Approaches</i>

## Feature Generation

In this notebook, we create our feature vectors for our unsupervised learning clustering methods. Our features contain at a high level two key pieces of information: (i) the returns of the stock on a monthly basis over the last 72 months and (ii) the firm characteristics of the stock on a quarterly basis over the last 72 months. In so doing, we capture both quantitative and qualitative aspects of the stock. 

### 1. Returns of the Stock Dataset: Filtering, Cleaning, and Generating

In this section, we load, clean, and filter our data for part (i) of our feature vector. We obtain our data from the Center for Research in Security Prices (CRSP), focusing on stocks with common shares listed on the New York Stock Exchange (NYSE), American Stock Exchange (AMEX), and Nasdaq. We omit stocks that have been delisted from these exchanges, as we cannot trade them, as well as stocks missing 25% of their data to ensure the robustness of our analysis. We also exclude stocks with low trading volumes due to their inherent illiquidity and potential for heightened volatility. Our sampling period is 1/2010 to 1/2016.

In [241]:
import pandas as pd

df = pd.read_csv('/Users/kevinwon/Desktop/quant/data.csv')
df.head()

Unnamed: 0,PERMNO,date,SHRCD,EXCHCD,TICKER,DLSTCD,PRC,VOL,RET,SHROUT
0,10001,01/29/2010,11,2,EGAS,,10.06,3104.0,-0.018932,4361.0
1,10001,02/26/2010,11,2,EGAS,,10.0084,1510.0,-0.000656,4361.0
2,10001,03/31/2010,11,2,EGAS,,10.17,2283.0,0.020643,4361.0
3,10001,04/30/2010,11,2,EGAS,,11.39,3350.0,0.124385,6070.0
4,10001,05/28/2010,11,2,EGAS,,11.4,3451.0,0.004829,6071.0


In [244]:
print("Number of rows in dataset: ", len(df))
print("Number of stocks in dataset: ", len(set(df['PERMNO'])))

Number of rows in dataset:  278179
Number of stocks in dataset:  5341


#### Exclusion of Delisted Stocks

Below are the delisting codes for CRSP:

- (1) Still trading or halted but not yet delisted
- (2) Merger
- (3) Exchange
- (4) Liquidation
- (5) Delisted by NYSE, AMEX, or Nasdaq
- (7) Delisted by the Securities and Exchange Commission (SEC)
- (8) Trading simultaneously on more than one exchange

We remove stocks with a delisting code of 

- (2) Because stocks in the midst of a merger may exhibit altered trading dynamics
- (3) Because a stock's transfer to a different exchange could potentially impact its volatility and liquidity
- (4) Because the process of liquidation frequently precedes delisting and can significantly influence the stock's market value
- (5) Because delisting from prominent exchanges such as the NYSE, AMEX, or Nasdaq signifies significant underlying financial or operational issues
- (7) Because delisting by the SEC usually occurs due to severe violations of regulatory standards or failure to adhere to financial reporting requirements

In [235]:
# Convert 'DLSTCD' to a string and pad with zeros to ensure 3 digits
df['DLSTCD_str'] = df['DLSTCD'].astype(str).str.pad(3, fillchar='0')

# Filter the DataFrame based on the condition that the hundredth digit is 2, 3, 4, 5, or 7
delisted_stocks = df[df['DLSTCD_str'].str[0].isin(['2', '3', '4', '5', '7'])]

# Extract the PERMNO identifiers for these filtered rows
permno_list_to_remove_delist = list(set(delisted_stocks['PERMNO']))

#### Exclusion of Illiquid Stocks

We add a "turnover" column to our dataset, defined as trading volume divided by shares outstanding, which serves as a key indicator of liquidity. To establish a robust liquidity threshold for the exclusion of low-turnover stocks, we conduct various statistical tests on the turnover data. We calculate percentiles, mean, median, and standard deviation to determine an appropriate threshold. We decide to exclude stocks that fall below the 25th percentile, as this threshold strikes an optimal balance, offering the potential for higher returns due to the wider bid-ask spreads characteristic of less liquid stocks, while also acknowledging the associated increased risks.

In [245]:
# Add a "turnover" column to the dataset
df['Turnover'] = df['VOL']/df['SHROUT']
df.head()

Unnamed: 0,PERMNO,date,SHRCD,EXCHCD,TICKER,DLSTCD,PRC,VOL,RET,SHROUT,Turnover
0,10001,01/29/2010,11,2,EGAS,,10.06,3104.0,-0.018932,4361.0,0.711763
1,10001,02/26/2010,11,2,EGAS,,10.0084,1510.0,-0.000656,4361.0,0.346251
2,10001,03/31/2010,11,2,EGAS,,10.17,2283.0,0.020643,4361.0,0.523504
3,10001,04/30/2010,11,2,EGAS,,11.39,3350.0,0.124385,6070.0,0.551895
4,10001,05/28/2010,11,2,EGAS,,11.4,3451.0,0.004829,6071.0,0.56844


In [246]:
# Basic statistics
mean_turnover = df['Turnover'].mean()
median_turnover = df['Turnover'].median()
std_dev_turnover = df['Turnover'].std()

print(f"Mean Turnover: {mean_turnover}")
print(f"Median Turnover: {median_turnover}")
print(f"Standard Deviation of Turnover: {std_dev_turnover}")

# Percentile analysis
percentiles = [10, 25, 50, 75, 90]
percentile_values = df['Turnover'].quantile([p / 100 for p in percentiles]).to_dict()

print("\nTurnover Percentiles:")
for percentile, value in percentile_values.items():
    print(f"{percentile * 100}th percentile: {value}")

# Exclude stocks below the 25th percentile
threshold = percentile_values[0.25]
print(f"\nSuggested Threshold (25th percentile): {threshold}")

illiquid_stocks = df[df['Turnover'] <= threshold] 
permno_list_to_remove_illiquid = list(set(illiquid_stocks['PERMNO']))

Mean Turnover: 1.753014015862979
Median Turnover: 1.128333464703306
Standard Deviation of Turnover: 3.622119669395907

Turnover Percentiles:
10.0th percentile: 0.17888617676375773
25.0th percentile: 0.4934275044261027
50.0th percentile: 1.128333464703306
75.0th percentile: 2.093354380353686
90.0th percentile: 3.615274512236201

Suggested Threshold (25th percentile): 0.4934275044261027


However, we also want to ensure that only stocks with a significant history of low turnover are considered, reducing the chance of excluding stocks due to short-term anomalies or data issues. Let's also remove stocks with prices less than or equal to 0.

In [237]:
for item in permno_list_to_remove_illiquid:
    temp = illiquid_stocks.loc[illiquid_stocks['PERMNO'] == item]
    if len(temp) < 3:                    
        permno_list_to_remove_illiquid.remove(item)

We now remove all stocks from the dataset based on the above conditions.

In [238]:
final_list = list(set(permno_list_to_remove_delist) | set(permno_list_to_remove_negative) | set(permno_list_to_remove_illiquid))
filtered_df = df.loc[df['PERMNO'].isin(final_list) == False]

print("Dataset size: ", len(filtered_df))
print("Number of stocks: ", len(set(filtered_df['PERMNO'])))
filtered_df.head()

Dataset size:  119737
Number of stocks:  1995


Unnamed: 0,PERMNO,date,SHRCD,EXCHCD,TICKER,DLSTCD,PRC,VOL,RET,SHROUT,DLSTCD_str,Turnover
323,10032,01/29/2010,11,3,PLXS,,34.01,91386.0,0.194171,39774.0,,2.297632
324,10032,02/26/2010,11,3,PLXS,,34.49,66482.0,0.014114,39774.0,,1.671494
325,10032,03/31/2010,11,3,PLXS,,36.03,53286.0,0.044651,39774.0,,1.339719
326,10032,04/30/2010,11,3,PLXS,,37.08,56600.0,0.029142,40359.0,,1.402413
327,10032,05/28/2010,11,3,PLXS,,34.05,61408.0,-0.081715,40359.0,,1.521544


Let's now create the actual feature vectors! Here is how we do this:

We transform our dataset into a pivot table, organizing stock return data ('RET') with stocks ('PERMNO') as rows and dates as columns. We then clean the data by removing stocks with over 25% missing values, ensuring a robust dataset. Remaining missing values are filled using a forward-fill method limited to 5 consecutive fills, balancing data integrity with practical imputation. 

In [239]:
# Convert 'RET' to numeric, coercing errors to NaN
filtered_df['RET'] = pd.to_numeric(filtered_df['RET'], errors='coerce')

# Get the pivot table, with stock permno as index and datetime as columns
df_price = pd.pivot_table(filtered_df, values = 'RET', index = 'PERMNO', columns = 'date')

# Delete stocks with more than half nan price for clustering
df_price = df_price.dropna(thresh = 3*len(df_price.columns)//4)
df_price = df_price.fillna(method = 'ffill', axis = 1, limit = 5)

df_price.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['RET'] = pd.to_numeric(filtered_df['RET'], errors='coerce')
  df_price = df_price.fillna(method = 'ffill', axis = 1, limit = 5)


date,01/29/2010,01/29/2016,01/30/2015,01/31/2011,01/31/2012,01/31/2013,01/31/2014,02/26/2010,02/27/2015,02/28/2011,...,11/30/2010,11/30/2011,11/30/2012,11/30/2015,12/30/2011,12/31/2010,12/31/2012,12/31/2013,12/31/2014,12/31/2015
PERMNO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10032,0.194171,0.000859,-0.080563,-0.12605,0.322863,-0.010853,-0.096789,0.014114,0.062286,0.162352,...,-0.105931,0.05642,-0.139725,0.074523,0.008471,0.140225,0.114471,0.072331,0.056396,-0.06129
10051,0.175705,-0.179939,-0.014612,-0.030203,0.048154,0.050073,-0.140569,0.146986,0.199722,0.309002,...,0.024573,-0.078872,0.02998,0.07975,0.168125,0.104797,0.047874,0.012873,0.020979,0.056519
10107,-0.075459,-0.00703,-0.130248,-0.006628,0.137519,0.027717,0.011494,0.022001,0.093069,-0.035528,...,-0.046784,-0.031919,-0.05939,0.039324,0.014855,0.105018,0.003558,-0.018883,-0.028446,0.020791
10138,-0.068169,-0.007554,-0.083159,0.021382,0.015628,0.09664,-0.063627,0.021564,0.049289,0.01608,...,0.055365,0.074186,-0.003237,0.007009,0.008809,0.111092,0.027634,0.045861,0.033904,-0.054366
10145,-0.014286,-0.003572,-0.021617,0.053612,0.067893,0.075154,-0.001532,0.047166,0.056644,0.039859,...,0.061611,0.040506,0.008165,0.012248,0.003693,0.069403,0.034893,0.032313,0.00858,-0.003656


In [240]:
print("Number of training stocks: ", len(df_price))

Number of training stocks:  1499
