# Unsupervised Learning Trading Strategy

* Download/Load SP500 stocks prices data.
* Calculate different features and indicators on each stock.
* Aggregate on monthly level and filter top 150 most liquid stocks.
* Calculate Monthly Returns for different time-horizons.
* Download Fama-French Factors and Calculate Rolling Factor Betas.
* For each month fit a K-Means Clustering Algorithm to group similar assets based on their features.
* For each month select assets based on the cluster and form a portfolio based on Efficient Frontier max sharpe ratio optimization.
* Visualize Portfolio returns and compare to SP500 returns.

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import yfinance as yf
import pandas_ta
from sklearn.cluster import KMeans
from pypfopt import risk_models, expected_returns
from pypfopt.efficient_frontier import EfficientFrontier
import warnings
warnings.filterwarnings('ignore')

**Pre-Process the data**

In [17]:
sp500 = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]
sp500['Symbol'] = sp500['Symbol'].str.replace('.', '-') # for two tickers - BKB.B, BF.B
symbols_list = sp500['Symbol'].unique().tolist()
end_date = dt.datetime.now()
start_date = end_date - pd.DateOffset(365*7)
df = yf.download(tickers=symbols_list, end=end_date, start=start_date).stack()
df.index.names = ['date', 'ticker']
df.columns = df.columns.str.lower()
df

[*********************100%***********************]  503 of 503 completed


Unnamed: 0_level_0,Price,close,high,low,open,volume
date,ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-03-27,A,63.610573,65.158111,63.173844,65.015705,2104500.0
2018-03-27,AAPL,39.719955,41.326780,39.384906,40.979932,163690400.0
2018-03-27,ABBV,67.489601,70.210896,66.836784,70.049525,10821300.0
2018-03-27,ABT,53.342697,54.528484,53.086074,54.218765,6365300.0
2018-03-27,ACGL,26.774134,27.151322,26.561765,26.713908,2462400.0
...,...,...,...,...,...,...
2025-03-24,XYL,120.800003,121.099998,119.279999,119.970001,2251800.0
2025-03-24,YUM,155.820007,157.720001,155.029999,157.070007,1952500.0
2025-03-24,ZBH,111.239998,112.279999,110.690002,110.930000,1034000.0
2025-03-24,ZBRA,297.410004,301.769989,293.859985,294.329987,512000.0


**Calculate Technical Indicators**

In [18]:
df['rsi'] = df.groupby(level=1)['close'].transform(lambda x: pandas_ta.rsi(close=x, length=20))
df['dollar_volume'] = (df['close']*df['volume']) / 1e6
df

Unnamed: 0_level_0,Price,close,high,low,open,volume,rsi,dollar_volume
date,ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2018-03-27,A,63.610573,65.158111,63.173844,65.015705,2104500.0,,133.868450
2018-03-27,AAPL,39.719955,41.326780,39.384906,40.979932,163690400.0,,6501.775395
2018-03-27,ABBV,67.489601,70.210896,66.836784,70.049525,10821300.0,,730.325221
2018-03-27,ABT,53.342697,54.528484,53.086074,54.218765,6365300.0,,339.542270
2018-03-27,ACGL,26.774134,27.151322,26.561765,26.713908,2462400.0,,65.928627
...,...,...,...,...,...,...,...,...
2025-03-24,XYL,120.800003,121.099998,119.279999,119.970001,2251800.0,43.098372,272.017447
2025-03-24,YUM,155.820007,157.720001,155.029999,157.070007,1952500.0,59.976874,304.238564
2025-03-24,ZBH,111.239998,112.279999,110.690002,110.930000,1034000.0,56.957240,115.022158
2025-03-24,ZBRA,297.410004,301.769989,293.859985,294.329987,512000.0,40.455044,152.273922


* Aggregate to monthly level and filter top 150 most liquid stocks for each month. 
* Calculate 5-year rolling average of dollar volume for each stocks before filtering.**

In [19]:
# Monthly mean of dollar volume and month end value of RSI
df1 = df.unstack('ticker')['dollar_volume'].resample('M').mean().stack('ticker').to_frame('dollar_volume')
df2 = df.unstack()['rsi'].resample('M').last().stack('ticker').to_frame('rsi')
data = pd.concat([df1, df2], axis=1).dropna()
data['dollar_volume'] = data.iloc[:,0].unstack('ticker').rolling(5*12, min_periods=12).mean().stack()
data['dollar_volume_rank'] = data.groupby('date')['dollar_volume'].rank(ascending=False)
data = data[data['dollar_volume_rank']<150].drop(['dollar_volume', 'dollar_volume_rank'], axis=1)
data

Unnamed: 0_level_0,Unnamed: 1_level_0,rsi
date,ticker,Unnamed: 2_level_1
2019-03-31,AAPL,63.873748
2019-03-31,ABBV,50.140716
2019-03-31,ABT,61.581498
2019-03-31,ACN,74.400706
2019-03-31,ADBE,58.172481
...,...,...
2025-03-31,VZ,54.486121
2025-03-31,WDAY,46.991004
2025-03-31,WFC,51.250123
2025-03-31,WMT,41.668749


* For each month fit a K-Means Clustering Algorithm to group similar assets based on their features. 
* We will pre-define our centroids for each cluster.

In [20]:
initial_centroids = [[30], [45], [55], [70]] # RSI values

def get_clusters(df):
    df['cluster'] = KMeans(n_clusters=4,
                           random_state=0,
                           init=initial_centroids).fit(df).labels_
    return df

data = data.dropna().groupby('date', group_keys=False).apply(get_clusters)
data

Unnamed: 0_level_0,Unnamed: 1_level_0,rsi,cluster
date,ticker,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-03-31,AAPL,63.873748,3
2019-03-31,ABBV,50.140716,1
2019-03-31,ABT,61.581498,2
2019-03-31,ACN,74.400706,3
2019-03-31,ADBE,58.172481,2
...,...,...,...
2025-03-31,VZ,54.486121,2
2025-03-31,WDAY,46.991004,1
2025-03-31,WFC,51.250123,2
2025-03-31,WMT,41.668749,1


* For each month select assets based on the cluster and form a portfolio based on Efficient Frontier max sharpe ratio optimization

* First we will filter only stocks corresponding to the cluster we choose based on our hypothesis.

* Momentum is persistent and the idea would be that stocks clustered around RSI 70 centroid should continue to outperform in the following month - thus we would select stocks corresponding to cluster 3.

In [21]:
data = data[data['cluster']==3].drop(['cluster'], axis=1)
data = data.reset_index(level=1)
data.index = data.index + pd.DateOffset(1)
data = data.reset_index().set_index(['date', 'ticker'])
dates = data.index.get_level_values('date').unique().tolist()
portfolio = {}
for date in dates:
    portfolio[date.strftime('%Y-%m-%d')] = data.xs(date, level=0).index.tolist()
portfolio

{'2019-04-01': ['AAPL',
  'ACN',
  'ADP',
  'ALGN',
  'AMT',
  'ANET',
  'AVGO',
  'AZO',
  'CMG',
  'COST',
  'CSCO',
  'CSX',
  'DHR',
  'DLTR',
  'EL',
  'HD',
  'HON',
  'INTU',
  'LOW',
  'LULU',
  'MA',
  'MCD',
  'MDLZ',
  'MO',
  'MRK',
  'MSFT',
  'NEE',
  'NSC',
  'PEP',
  'PG',
  'PYPL',
  'SBUX',
  'SPGI',
  'TGT',
  'TJX',
  'TMO',
  'ULTA',
  'V'],
 '2019-05-01': ['ACN',
  'ADBE',
  'ADSK',
  'ALGN',
  'AMZN',
  'AVGO',
  'AXP',
  'BLK',
  'BRK-B',
  'C',
  'CHTR',
  'CMCSA',
  'CSX',
  'DIS',
  'DLTR',
  'EL',
  'F',
  'HON',
  'JPM',
  'KO',
  'LMT',
  'LRCX',
  'LULU',
  'MA',
  'MAR',
  'MCD',
  'MDLZ',
  'META',
  'MS',
  'MSFT',
  'NOW',
  'NSC',
  'PEP',
  'PNC',
  'PYPL',
  'QCOM',
  'RTX',
  'SBUX',
  'SPGI',
  'STZ',
  'USB',
  'V'],
 '2019-06-01': ['ACN',
  'AMT',
  'BSX',
  'CHTR',
  'CNC',
  'DHR',
  'DIS',
  'ELV',
  'KO',
  'LMT',
  'MA',
  'MCD',
  'MDT',
  'NEE',
  'NOC',
  'PEP',
  'TGT',
  'TTWO'],
 '2019-07-01': ['ABT',
  'AMAT',
  'AMGN',
  'BDX',
  '

* We will define a function which optimizes portfolio weights using PyPortfolioOpt package and EfficientFrontier optimizer to maximize the sharpe ratio.

* To optimize the weights of a given portfolio we would need to supply last 1 year prices to the function.

* Apply signle stock weight bounds constraint for diversification (maximum 10% of portfolio).

In [22]:
def optimize_weights(prices, lower_bound):
    returns = expected_returns.mean_historical_return(prices=prices, frequency=252)
    cov = risk_models.sample_cov(prices=prices, frequency=252)
    ef = EfficientFrontier(expected_returns=returns,
                           cov_matrix=cov,
                           weight_bounds=(lower_bound, .1),
                           solver='SCS')
    weights = ef.max_sharpe()
    return ef.clean_weights()

In [23]:
stocks = data.index.get_level_values('ticker').unique().tolist()
end_date = data.index.get_level_values('date').unique()[-1]
start_date = data.index.get_level_values('date').unique()[0] - pd.DateOffset(months=12)
data = yf.download(tickers=stocks, start=start_date, end=end_date).stack()
data

[*********************100%***********************]  173 of 173 completed


Unnamed: 0_level_0,Price,Close,High,Low,Open,Volume
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-04-02,AAPL,39.328270,39.861521,38.806821,39.318834,150347200.0
2018-04-02,ABBV,67.086166,69.418706,66.235306,68.949264,7816500.0
2018-04-02,ABT,51.510921,53.174563,50.997672,52.935636,6686500.0
2018-04-02,ACN,133.377853,136.838946,132.043898,136.838946,2845800.0
2018-04-02,ADBE,212.279999,216.500000,207.220001,214.809998,3494900.0
...,...,...,...,...,...,...
2025-03-24,WFC,74.279999,74.529999,73.339996,73.639999,10685800.0
2025-03-24,WMT,87.489998,87.650002,86.349998,86.470001,17891100.0
2025-03-24,WYNN,84.870003,85.309998,83.620003,83.849998,2447500.0
2025-03-24,XOM,115.800003,116.910004,115.580002,115.680000,14201300.0


* Calculate daily returns for each stock which could land up in our portfolio.

* Then loop over each month start, select the stocks for the month and calculate their weights for the next month.

* If the maximum sharpe ratio optimization fails for a given month, apply equally-weighted weights.

* Calculated each day portfolio return.

In [29]:
returns_df = data['Close'].pct_change()
portfolio_df = pd.DataFrame()
for start_date in portfolio.keys():
    try:
        end_date = pd.to_datetime(start_date) + pd.offsets.MonthEnd(0)
        cols = portfolio[start_date]
        opt_start_date = pd.to_datetime(start_date) - pd.DateOffset(months=12)
        opt_end_date = pd.to_datetime(start_date) - pd.DateOffset(days=1)
        opt_df = pd.DataFrame()
        opt_df = data[opt_start_date:opt_end_date]['Close'].unstack()[cols]
        success = False
        try:
            weights = optimize_weights(prices=opt_df, lower_bound=round(1/(len(opt_df.columns)*2),3))
            weights = pd.DataFrame(weights, index=pd.Series(0))
            weights = weights.stack().to_frame('weight').reset_index(level=0, drop=True)
            success = True
        except:
            print(f'Max Sharpe Optimization failed for {start_date}, continuing with Equal-Weights')
        if success==False:
            weights = pd.DataFrame([1/len(opt_df.columns) for i in range(len(opt_df.columns))],
                                     index=opt_df.columns.tolist(),
                                     columns=pd.Series(0)).T
            weights = weights.stack().to_frame('weight').reset_index(level=0, drop=True)
            
        temp_df = returns_df[start_date:end_date].to_frame('returns').div(100)
        temp_df = temp_df.reset_index(level=1)
        temp_df = temp_df.loc[temp_df["Ticker"].isin(cols)]
        temp_df = temp_df.merge(weights, left_on="Ticker", right_index=True, how="left")
        temp_df["weighted_return"] = temp_df["returns"] * temp_df["weight"]
        temp_df = temp_df.reset_index().set_index(['Date', 'Ticker'])
        temp_df = temp_df.groupby(level=0)['weighted_return'].sum().to_frame('Strategy Return')
        portfolio_df = pd.concat([portfolio_df, temp_df], axis=0)
    except Exception as e:
        print(e)

portfolio_df = portfolio_df.drop_duplicates()
portfolio_df

Max Sharpe Optimization failed for 2020-03-01, continuing with Equal-Weights
Max Sharpe Optimization failed for 2020-04-01, continuing with Equal-Weights
Max Sharpe Optimization failed for 2021-02-01, continuing with Equal-Weights
Max Sharpe Optimization failed for 2021-10-01, continuing with Equal-Weights
Max Sharpe Optimization failed for 2022-05-01, continuing with Equal-Weights
Max Sharpe Optimization failed for 2022-09-01, continuing with Equal-Weights
Max Sharpe Optimization failed for 2022-10-01, continuing with Equal-Weights
Max Sharpe Optimization failed for 2023-11-01, continuing with Equal-Weights


Unnamed: 0_level_0,Strategy Return
Date,Unnamed: 1_level_1
2019-04-01,0.021746
2019-04-02,0.021870
2019-04-03,0.021756
2019-04-04,0.021727
2019-04-05,0.021508
...,...
2025-03-18,0.007657
2025-03-19,0.007536
2025-03-20,0.007546
2025-03-21,0.007600


In [25]:
total = np.sum(portfolio_df['Strategy Return'])
total

27.120810540229257

In [26]:
np.min(portfolio_df['Strategy Return'])

-0.004793931606988619

In [27]:
np.max(portfolio_df['Strategy Return'])

0.2100792889020078

In [28]:
np.std(portfolio_df['Strategy Return'])

0.017314099004128607