## Unsupervised Learning Strategy

* Download/load SP500 stocks price data
* Aggregate on monthly level and filter top 150 most liquid stocks
* Calculate monthly returns for differnt time-horizons
* Download Fama-French Factors and calculate rolling factor betas
* For each month fit a k-means clustering algorithm to group similar assets based on their features
* For each month select assets based on the cluster and form a portfolio based on efficent frontier max sharpe ratio optimization
* Visualize portfolio returns and compare to Sp500 returns

In [69]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.regression.rolling import RollingOLS
import yfinance as yf
import pandas_datareader.data as web
import statsmodels.api as sm
import datetime as dt
import pandas_ta
import warnings
warnings.filterwarnings("ignore")

**Download SP500 stock data**

In [70]:
sp500 = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]
sp500['Symbol'] = sp500['Symbol'].str.replace('.', '-')
symbols = list(sp500['Symbol'].unique())
symbols

['MMM',
 'AOS',
 'ABT',
 'ABBV',
 'ACN',
 'ADBE',
 'AMD',
 'AES',
 'AFL',
 'A',
 'APD',
 'ABNB',
 'AKAM',
 'ALB',
 'ARE',
 'ALGN',
 'ALLE',
 'LNT',
 'ALL',
 'GOOGL',
 'GOOG',
 'MO',
 'AMZN',
 'AMCR',
 'AEE',
 'AAL',
 'AEP',
 'AXP',
 'AIG',
 'AMT',
 'AWK',
 'AMP',
 'AME',
 'AMGN',
 'APH',
 'ADI',
 'ANSS',
 'AON',
 'APA',
 'AAPL',
 'AMAT',
 'APTV',
 'ACGL',
 'ADM',
 'ANET',
 'AJG',
 'AIZ',
 'T',
 'ATO',
 'ADSK',
 'ADP',
 'AZO',
 'AVB',
 'AVY',
 'AXON',
 'BKR',
 'BALL',
 'BAC',
 'BK',
 'BBWI',
 'BAX',
 'BDX',
 'BRK-B',
 'BBY',
 'BIO',
 'TECH',
 'BIIB',
 'BLK',
 'BX',
 'BA',
 'BKNG',
 'BWA',
 'BSX',
 'BMY',
 'AVGO',
 'BR',
 'BRO',
 'BF-B',
 'BLDR',
 'BG',
 'BXP',
 'CHRW',
 'CDNS',
 'CZR',
 'CPT',
 'CPB',
 'COF',
 'CAH',
 'KMX',
 'CCL',
 'CARR',
 'CTLT',
 'CAT',
 'CBOE',
 'CBRE',
 'CDW',
 'CE',
 'COR',
 'CNC',
 'CNP',
 'CF',
 'CRL',
 'SCHW',
 'CHTR',
 'CVX',
 'CMG',
 'CB',
 'CHD',
 'CI',
 'CINF',
 'CTAS',
 'CSCO',
 'C',
 'CFG',
 'CLX',
 'CME',
 'CMS',
 'KO',
 'CTSH',
 'CL',
 'CMCSA',
 'CAG'

In [71]:
end_date = '2023-09-27'
start_date = pd.to_datetime(end_date) - pd.DateOffset(years=5)

df = yf.download(tickers=symbols,
                 start=start_date, 
                 end=end_date).stack()

df

[*********************100%%**********************]  503 of 503 completed

4 Failed downloads:
['VLTO', 'SW', 'SOLV', 'GEV']: YFChartError("%ticker%: Data doesn't exist for startDate = 1538020800, endDate = 1695787200")


Unnamed: 0_level_0,Price,Adj Close,Close,High,Low,Open,Volume
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-09-27,A,67.763641,70.800003,70.849998,70.099998,70.580002,1581700.0
2018-09-27,AAL,40.727219,41.500000,42.200001,41.150002,41.230000,5654600.0
2018-09-27,AAPL,53.586330,56.237499,56.610001,55.884998,55.955002,120724800.0
2018-09-27,ABBV,71.712502,94.139999,94.889999,93.959999,94.349998,3028600.0
2018-09-27,ABT,65.870628,73.019997,73.180000,72.690002,73.019997,5493900.0
...,...,...,...,...,...,...,...
2023-09-26,XYL,88.501091,89.519997,90.849998,89.500000,90.379997,1322400.0
2023-09-26,YUM,121.604256,124.010002,124.739998,123.449997,124.239998,1500600.0
2023-09-26,ZBH,111.534821,112.459999,117.110001,112.419998,116.769997,3610500.0
2023-09-26,ZBRA,223.960007,223.960007,226.649994,222.580002,225.970001,355400.0


In [72]:
df.columns = df.columns.str.lower()
df

Unnamed: 0_level_0,Price,adj close,close,high,low,open,volume
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-09-27,A,67.763641,70.800003,70.849998,70.099998,70.580002,1581700.0
2018-09-27,AAL,40.727219,41.500000,42.200001,41.150002,41.230000,5654600.0
2018-09-27,AAPL,53.586330,56.237499,56.610001,55.884998,55.955002,120724800.0
2018-09-27,ABBV,71.712502,94.139999,94.889999,93.959999,94.349998,3028600.0
2018-09-27,ABT,65.870628,73.019997,73.180000,72.690002,73.019997,5493900.0
...,...,...,...,...,...,...,...
2023-09-26,XYL,88.501091,89.519997,90.849998,89.500000,90.379997,1322400.0
2023-09-26,YUM,121.604256,124.010002,124.739998,123.449997,124.239998,1500600.0
2023-09-26,ZBH,111.534821,112.459999,117.110001,112.419998,116.769997,3610500.0
2023-09-26,ZBRA,223.960007,223.960007,226.649994,222.580002,225.970001,355400.0


### 2. Calculate features and techinal indicators for each stock

* **Garman-Klass Volatility** - *Volatility metric that includes open, close, high, and low*
* **RSI** (Relative Strength Index) - *momentum oscillator used in technical analysis to measure the speed and change of price movements*
* **Bollinger Bands** - *measure of volatility* - middle band is simple moving average - upper and lower are +- 2 SD in period
* **ATR** (Average True Range) - *measures market volatility by calculating the average range of price movement over a specified period*
* **MACD** (Moving Average Convergence Divergence) - *momentum indicator used in technical analysis to identify trends, momentum, and potential reversal points in the price of a financial asset*
* **Dollar Volume** - *measure of the total monetary value of shares traded for a particular security or in a market over a specific period*

In [73]:
# calculate the garman_klass_volitility
df['garman_klass_vol'] = ((np.log(df['high'])-np.log(df['low']))**2)/2-(2*np.log(2)-1)*(np.log(df['adj close'])-np.log(df['open']))**2

In [74]:
# calculate the rsi
df['rsi'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.rsi(close=x, length=20))

In [75]:
# calculate the bollinger bands
df['bb_low'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:,0])
df['bb_mid'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:,1])
df['bb_high'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:,2])


In [76]:
# calculate the ATR
def compute_atr(stock_data):
    atr = pandas_ta.atr(high=stock_data['high'],
                        low=stock_data['low'], 
                        close=stock_data['close'], 
                        length=14)
    return atr.sub(atr.mean()).div(atr.std())

df['atr'] = df.groupby(level=1, group_keys=False).apply(compute_atr)

In [77]:
# calculate macd
def compute_macd(close):
    macd = pandas_ta.macd(close=close, length = 20).iloc[:,0]
    return macd.sub(macd.mean()).div(macd.std())

df['macd'] = df.groupby(level=1, group_keys=False)['adj close'].apply(compute_macd)

In [78]:
# calculate the dollar volume
df['dollar_volume'] = df['adj close']*df['volume']/1e6
df

Unnamed: 0_level_0,Price,adj close,close,high,low,open,volume,garman_klass_vol,rsi,bb_low,bb_mid,bb_high,atr,macd,dollar_volume
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2018-09-27,A,67.763641,70.800003,70.849998,70.099998,70.580002,1581700.0,-0.000584,,,,,,,107.181752
2018-09-27,AAL,40.727219,41.500000,42.200001,41.150002,41.230000,5654600.0,0.000259,,,,,,,230.296130
2018-09-27,AAPL,53.586330,56.237499,56.610001,55.884998,55.955002,120724800.0,-0.000640,,,,,,,6469.199022
2018-09-27,ABBV,71.712502,94.139999,94.889999,93.959999,94.349998,3028600.0,-0.029026,,,,,,,217.188482
2018-09-27,ABT,65.870628,73.019997,73.180000,72.690002,73.019997,5493900.0,-0.004079,,,,,,,361.886645
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-09-26,XYL,88.501091,89.519997,90.849998,89.500000,90.379997,1322400.0,-0.000058,26.146725,4.483137,4.565058,4.646979,-0.701611,-1.727457,117.033843
2023-09-26,YUM,121.604256,124.010002,124.739998,123.449997,124.239998,1500600.0,-0.000124,36.057218,4.806770,4.836734,4.866698,-0.394780,-1.104325,182.479346
2023-09-26,ZBH,111.534821,112.459999,117.110001,112.419998,116.769997,3610500.0,0.000022,31.893225,4.745884,4.785551,4.825217,-0.779533,-0.733478,402.696470
2023-09-26,ZBRA,223.960007,223.960007,226.649994,222.580002,225.970001,355400.0,0.000133,29.494977,5.400991,5.539167,5.677342,-0.889014,-1.269833,79.595386


### 3. Aggregate to monthly level and filter top 150 most liquid stocks for each month

* To reduce training time and experiment with features and strategies, we convert the business-daily data to month-end frequency

In [79]:
last_cols = [c for c in df.columns.unique() if c not in ['dollar_volume', 'volume', 'open', 'high', 'low', 'close']]


data = (pd.concat([df.unstack('Ticker')['dollar_volume'].resample('M').mean().stack('Ticker').to_frame('dollar_volume'),
          df.unstack()[last_cols].resample('M').last().stack('Ticker')],
            axis=1)).dropna()

data

Unnamed: 0_level_0,Unnamed: 1_level_0,dollar_volume,adj close,garman_klass_vol,rsi,bb_low,bb_mid,bb_high,atr,macd
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2018-11-30,A,190.565783,69.393738,-0.000223,63.500009,4.102268,4.180028,4.257788,-0.954636,0.399540
2018-11-30,AAL,317.481135,39.520954,0.000717,62.285730,3.561579,3.626777,3.691974,1.609222,1.462637
2018-11-30,AAPL,8402.804690,42.688766,-0.000970,35.364654,3.707130,3.833351,3.959571,-1.029003,-1.277233
2018-11-30,ABBV,518.552132,72.579720,-0.017217,61.190725,4.154765,4.226265,4.297765,0.374620,-0.027624
2018-11-30,ABT,409.388037,67.074455,-0.003534,60.383999,4.129261,4.176752,4.224242,-0.719563,0.147162
...,...,...,...,...,...,...,...,...,...,...
2023-09-30,OTIS,153.715885,78.028641,-0.000190,33.116174,4.365997,4.411282,4.456568,-1.028320,-1.534536
2023-09-30,ABNB,1633.500725,132.279999,0.000213,44.494127,4.857047,4.940924,5.024801,-1.006939,-0.037854
2023-09-30,CEG,196.304723,107.661484,0.000080,55.245457,4.650304,4.690476,4.730649,-0.436215,0.366876
2023-09-30,GEHC,212.197215,66.105721,0.000185,40.922327,4.155071,4.212607,4.270142,-0.893478,-1.116463


* Calculate 5-year rolling average of dollar volume for each stock before filtering

In [80]:
data['dollar_volume'] = (data.loc[:, 'dollar_volume'].unstack('Ticker').rolling(5*12, min_periods=12).mean().stack())

data['dollar_vol_rank'] = data.groupby('Date')['dollar_volume'].rank(ascending=False)

data = data[data['dollar_vol_rank']<150].drop(['dollar_vol_rank', 'dollar_volume'], axis=1)

data

Unnamed: 0_level_0,Unnamed: 1_level_0,adj close,garman_klass_vol,rsi,bb_low,bb_mid,bb_high,atr,macd
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-10-31,AAL,29.852556,0.000296,58.952287,3.263377,3.372292,3.481206,-0.005208,1.301410
2019-10-31,AAPL,60.177776,0.000923,68.908191,4.006348,4.067200,4.128052,-1.328535,0.371075
2019-10-31,ABBV,64.707718,-0.016716,70.480830,4.076652,4.136870,4.197088,-1.377833,0.733365
2019-10-31,ABT,76.989975,-0.002606,54.403463,4.293377,4.330068,4.366760,-0.836382,0.044183
2019-10-31,ACN,172.986679,-0.001881,47.064280,5.136871,5.156541,5.176211,-0.972737,-0.452549
...,...,...,...,...,...,...,...,...,...
2023-09-30,XOM,112.466652,-0.000205,59.440186,4.679146,4.719239,4.759332,0.236421,1.124268
2023-09-30,MRNA,98.120003,0.000146,38.747314,4.582514,4.685332,4.788149,-0.529511,-0.376899
2023-09-30,UBER,44.270000,0.000441,45.005268,3.806654,3.862227,3.917801,-0.746098,-0.133973
2023-09-30,CRWD,160.479996,0.000144,51.534803,5.026187,5.103696,5.181204,-0.744862,0.245950


## 4. Calculate Monthly Returns for different time horizons as features.

* To capture time series dynamics that reflect, for example, momentum patterns, we compute historical returns using the method .pct_change(lag), that is, returns over various monthly periods as identified by lags

In [81]:
def calculate_returns(df):

    outlier_cutoff = 0.005

    lags = [1, 2, 3, 6, 9, 12]

    for lag in lags:

        df[f'return_{lag}m'] = (df['adj close']
                                .pct_change(lag)
                                .pipe(lambda x: x.clip(lower = x.quantile(outlier_cutoff),
                                                      upper = x.quantile(1-outlier_cutoff)))
                                .add(1)
                                .pow(1/lag)
                                .sub(1))
        
    return df

data = data.groupby('Ticker', group_keys=False).apply(calculate_returns).dropna()

In [82]:
data

Unnamed: 0_level_0,Unnamed: 1_level_0,adj close,garman_klass_vol,rsi,bb_low,bb_mid,bb_high,atr,macd,return_1m,return_2m,return_3m,return_6m,return_9m,return_12m
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2020-10-31,AAPL,106.382866,0.000040,43.770313,4.684913,4.743730,4.802548,0.973832,-0.574353,-0.060012,-0.081515,0.008696,0.068576,0.039452,0.048624
2020-10-31,ABBV,73.054329,-0.005843,47.517633,4.256774,4.300160,4.343547,-0.568271,-0.889501,-0.015197,-0.050954,-0.031356,0.010106,0.010004,0.010162
2020-10-31,ABT,98.313591,-0.001289,48.047166,4.595915,4.623247,4.650579,0.484444,-0.039209,-0.030963,-0.018485,0.015717,0.023513,0.022346,0.020583
2020-10-31,ACN,205.613724,-0.000628,42.280810,5.318201,5.367234,5.416267,-0.012704,-0.789110,-0.036420,-0.047322,-0.010521,0.027996,0.007530,0.014503
2020-10-31,ADBE,447.100006,0.000472,39.586497,6.119939,6.192112,6.264286,0.513221,-0.796865,-0.088351,-0.066792,0.002081,0.039858,0.027208,0.040413
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-09-30,XOM,112.466652,-0.000205,59.440186,4.679146,4.719239,4.759332,0.236421,1.124268,0.046947,0.046139,0.030496,0.012838,0.008747,0.027037
2023-09-30,MRNA,98.120003,0.000146,38.747314,4.582514,4.685332,4.788149,-0.529511,-0.376899,-0.132219,-0.086803,-0.068763,-0.071952,-0.064976,-0.015431
2023-09-30,UBER,44.270000,0.000441,45.005268,3.806654,3.862227,3.917801,-0.746098,-0.133973,-0.062672,-0.053920,0.008422,0.057244,0.066838,0.043691
2023-09-30,CRWD,160.479996,0.000144,51.534803,5.026187,5.103696,5.181204,-0.744862,0.245950,-0.015641,-0.003656,0.029981,0.026391,0.047942,-0.002216


## 5. Download Fama-French Factors and Calculate Rolling Factor Betas.

* Fama-French data to estimate the exposure of assests to common risk factors using linear regression
* The five Fama-French factors, namely market risk, size, value, operating profitability, and investment have been shown empirically explain asset returns in the past.

In [83]:
factor_data = web.DataReader('F-F_Research_Data_5_Factors_2x3',
               'famafrench', 
               start='2010')[0].drop('RF', axis=1)

factor_data.index = factor_data.index.to_timestamp()

factor_data = factor_data.resample('M').last().div(100)

factor_data = factor_data.join(data['return_1m']).sort_index()

factor_data

Unnamed: 0_level_0,Unnamed: 1_level_0,Mkt-RF,SMB,HML,RMW,CMA,return_1m
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-10-31,AAPL,-0.0210,0.0454,0.0431,-0.0076,-0.0088,-0.060012
2020-10-31,ABBV,-0.0210,0.0454,0.0431,-0.0076,-0.0088,-0.015197
2020-10-31,ABT,-0.0210,0.0454,0.0431,-0.0076,-0.0088,-0.030963
2020-10-31,ACN,-0.0210,0.0454,0.0431,-0.0076,-0.0088,-0.036420
2020-10-31,ADBE,-0.0210,0.0454,0.0431,-0.0076,-0.0088,-0.088351
...,...,...,...,...,...,...,...
2023-09-30,VRTX,-0.0524,-0.0179,0.0145,0.0185,-0.0084,0.009617
2023-09-30,VZ,-0.0524,-0.0179,0.0145,0.0185,-0.0084,-0.056890
2023-09-30,WFC,-0.0524,-0.0179,0.0145,0.0185,-0.0084,-0.015500
2023-09-30,WMT,-0.0524,-0.0179,0.0145,0.0185,-0.0084,-0.000676


In [84]:
observations = factor_data.groupby(level=1).size()

valid_stocks = observations[observations>20]

factor_data = factor_data[factor_data.index.get_level_values('Ticker').isin(valid_stocks.index)]

factor_data

Unnamed: 0_level_0,Unnamed: 1_level_0,Mkt-RF,SMB,HML,RMW,CMA,return_1m
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-10-31,AAPL,-0.0210,0.0454,0.0431,-0.0076,-0.0088,-0.060012
2020-10-31,ABBV,-0.0210,0.0454,0.0431,-0.0076,-0.0088,-0.015197
2020-10-31,ABT,-0.0210,0.0454,0.0431,-0.0076,-0.0088,-0.030963
2020-10-31,ACN,-0.0210,0.0454,0.0431,-0.0076,-0.0088,-0.036420
2020-10-31,ADBE,-0.0210,0.0454,0.0431,-0.0076,-0.0088,-0.088351
...,...,...,...,...,...,...,...
2023-09-30,VRTX,-0.0524,-0.0179,0.0145,0.0185,-0.0084,0.009617
2023-09-30,VZ,-0.0524,-0.0179,0.0145,0.0185,-0.0084,-0.056890
2023-09-30,WFC,-0.0524,-0.0179,0.0145,0.0185,-0.0084,-0.015500
2023-09-30,WMT,-0.0524,-0.0179,0.0145,0.0185,-0.0084,-0.000676


Calculate rolling factor betas

In [85]:
betas = (factor_data.groupby(level=1,
                            group_keys=False)
         .apply(lambda x: RollingOLS(endog=x['return_1m'], 
                                     exog=sm.add_constant(x.drop('return_1m', axis=1)),
                                     window=min(24, x.shape[0]),
                                     min_nobs=len(x.columns)+1)
         .fit(params_only=True)
         .params
         .drop('const', axis=1)))

betas

Unnamed: 0_level_0,Unnamed: 1_level_0,Mkt-RF,SMB,HML,RMW,CMA
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-10-31,AAPL,,,,,
2020-10-31,ABBV,,,,,
2020-10-31,ABT,,,,,
2020-10-31,ACN,,,,,
2020-10-31,ADBE,,,,,
...,...,...,...,...,...,...
2023-09-30,VRTX,0.456767,-0.438857,-0.316524,-0.077672,0.802295
2023-09-30,VZ,0.328559,-0.161724,0.265469,0.318782,0.102685
2023-09-30,WFC,1.112946,0.306860,2.044026,-0.451834,-1.511410
2023-09-30,WMT,0.705092,-0.322083,-0.373224,-0.156547,0.485160


In [86]:
betas.groupby('Ticker').shift()

Unnamed: 0_level_0,Unnamed: 1_level_0,Mkt-RF,SMB,HML,RMW,CMA
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-10-31,AAPL,,,,,
2020-10-31,ABBV,,,,,
2020-10-31,ABT,,,,,
2020-10-31,ACN,,,,,
2020-10-31,ADBE,,,,,
...,...,...,...,...,...,...
2023-09-30,VRTX,0.505749,-0.402539,-0.485809,0.059479,0.948558
2023-09-30,VZ,0.298098,-0.206684,0.340681,0.325633,-0.001676
2023-09-30,WFC,1.130269,0.322854,1.988390,-0.416301,-1.457798
2023-09-30,WMT,0.744735,-0.270973,-0.481053,-0.134976,0.616545


In [89]:
# factors = ['Mkt-RF',	'SMB',	'HML',	'RMW',	'CMA']

# data = data.join(betas.groupby('Ticker').shift())

# data.loc[:, factors] = data.groupby('Ticker', group_keys=False)[factors].apply(lambda x: x.fillna(x.mean()))

data = data.drop('adj close', axis=1)

data = data.dropna()

data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4973 entries, (Timestamp('2020-10-31 00:00:00'), 'AAPL') to (Timestamp('2023-09-30 00:00:00'), 'CRWD')
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   garman_klass_vol  4973 non-null   float64
 1   rsi               4973 non-null   float64
 2   bb_low            4973 non-null   float64
 3   bb_mid            4973 non-null   float64
 4   bb_high           4973 non-null   float64
 5   atr               4973 non-null   float64
 6   macd              4973 non-null   float64
 7   return_1m         4973 non-null   float64
 8   return_2m         4973 non-null   float64
 9   return_3m         4973 non-null   float64
 10  return_6m         4973 non-null   float64
 11  return_9m         4973 non-null   float64
 12  return_12m        4973 non-null   float64
 13  Mkt-RF            4973 non-null   float64
 14  SMB               4973 non-null   float64
 15  HML       

In [90]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()


Unnamed: 0_level_0,Unnamed: 1_level_0,garman_klass_vol,rsi,bb_low,bb_mid,bb_high,atr,macd,return_1m,return_2m,return_3m,return_6m,return_9m,return_12m,Mkt-RF,SMB,HML,RMW,CMA
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2020-10-31,AAPL,0.000040,43.770313,4.684913,4.743730,4.802548,0.973832,-0.574353,-0.060012,-0.081515,0.008696,0.068576,0.039452,0.048624,1.166758,-0.298913,-0.455019,0.224538,0.126958
2020-10-31,ABBV,-0.005843,47.517633,4.256774,4.300160,4.343547,-0.568271,-0.889501,-0.015197,-0.050954,-0.031356,0.010106,0.010004,0.010162,0.504207,-0.163212,-0.387343,0.456908,0.898047
2020-10-31,ABT,-0.001289,48.047166,4.595915,4.623247,4.650579,0.484444,-0.039209,-0.030963,-0.018485,0.015717,0.023513,0.022346,0.020583,0.675211,0.055411,-0.336906,0.524448,0.632236
2020-10-31,ACN,-0.000628,42.280810,5.318201,5.367234,5.416267,-0.012704,-0.789110,-0.036420,-0.047322,-0.010521,0.027996,0.007530,0.014503,1.292610,-0.282207,-0.278870,0.541214,0.013725
2020-10-31,ADBE,0.000472,39.586497,6.119939,6.192112,6.264286,0.513221,-0.796865,-0.088351,-0.066792,0.002081,0.039858,0.027208,0.040413,1.571286,-0.851378,-0.029738,0.123233,-0.253627
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-09-30,WMT,-0.000074,54.722574,3.982183,3.999651,4.017120,-0.706496,0.318756,-0.000676,0.010014,0.012354,0.017574,0.016553,0.020256,0.744735,-0.270973,-0.481053,-0.134976,0.616545
2023-09-30,XOM,-0.000205,59.440186,4.679146,4.719239,4.759332,0.236421,1.124268,0.046947,0.046139,0.030496,0.012838,0.008747,0.027037,1.002500,-1.028811,1.690931,-0.665030,-0.274984
2023-09-30,MRNA,0.000146,38.747314,4.582514,4.685332,4.788149,-0.529511,-0.376899,-0.132219,-0.086803,-0.068763,-0.071952,-0.064976,-0.015431,1.158954,0.746945,-1.134382,0.610629,0.970695
2023-09-30,UBER,0.000441,45.005268,3.806654,3.862227,3.917801,-0.746098,-0.133973,-0.062672,-0.053920,0.008422,0.057244,0.066838,0.043691,1.090266,1.140600,-0.263071,-1.534224,-0.434869
