The point of this notebook is to experiment around the following "would you rather" question:

Would you rather:
1. Have a mostly accurate distribution fit, according to most days, but does not predict high return days with strong accuracy?
2. Have a less accurate distribution fit, according to most days, but predicts the high return days to a high degree of accuracy?

Results:

In the last 23 years, assuming trades only take place on a day-to-day basis, 53% of the positive return accumulation is accounted for in 0.112 percent of the days. No unit conversion needed- this means that 53% of the returns made in the SPY made in 23 years is made over 7 days?!?!?!?!?

This does not discredit day-to-day, Simons-strategy but it seems to be intuitively a massive point for B&H/Buffet strategy.

Question: Does this differential between optimal return and percentile-based return matter?
    The problem becomes: Someone leaving their money in for an extended period of time has 100% certainty to see gains on these 7 days.

In [18]:
import pandas as pd
import yfinance as yf
from Scrapers.yf_scraper import YFScraper

In [19]:
scraper = YFScraper()
data = scraper.download_and_add_features('SPY', start='2000-01-01', end='2023-01-01')

  df.index += _pd.TimedeltaIndex(dst_error_hours, 'h')
[*********************100%%**********************]  1 of 1 completed


In [20]:
data.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,daily_return,volume_change,STD_L=20,SMA_20,...,Bollinger_Upper_Band,Bollinger_Lower_Band,CMF_20,CCI_20,volume_oscillator,force_index,on_balance_volume,aroon_oscillator,volume_price_trend,ultimate_oscillator
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-01-03,148.25,148.25,143.875,145.4375,93.591171,8164300,,,,,...,,,,,,,8164300,,,
2000-01-04,143.53125,144.0625,139.640625,139.75,89.931145,8089800,-0.039107,-0.009125,,,...,,,,,,,74500,,-316360.893855,
2000-01-05,139.9375,141.53125,137.25,140.0,90.09201,12177900,0.001789,0.50534,,,...,,,,,,,12252400,,-294575.741798,
2000-01-06,139.625,141.5,137.75,137.75,88.644112,6227200,-0.016071,-0.488647,,,...,,,,,,,6025200,,-394655.741798,
2000-01-07,140.3125,145.75,140.0625,145.75,93.792236,8066500,0.058076,0.295365,,,...,,,,,,,14091700,,73816.127531,


In [21]:
def calculate_principal_with_data(data):
    principal = 1
    days_invested = 0
    for i in range(1, len(data)):
        day_return = data.iloc[i]['daily_return']
        if day_return > 0:
            days_invested += 1
            principal *= (1 + day_return)
    return principal, days_invested

In [22]:
baseline_opt_return, days_invested = calculate_principal_with_data(data)
print("We are not interested in the actual number of returns, but the difference when removing the top nth percent performing"
      "days. Therefore, let's make a function to filter out the top nth percentile of data, and simply divide the two to"
      "attain what multiple the best return to nth percentile return.")

We are not interested in the actual number of returns, but the difference when removing the top nth percent performingdays. Therefore, let's make a function to filter out the top nth percentile of data, and simply divide the two toattain what multiple the best return to nth percentile return.


In [23]:
def filter_top_percentile(df, percentile):
    # Calculate the cutoff value for the specified percentile
    cutoff = df['daily_return'].quantile(1 - percentile / 100.0)

    # Filter the dataframe to exclude values above the cutoff
    filtered_df = df[df['daily_return'] <= cutoff]

    return filtered_df

In [38]:
performances = []
for perc in [0.112*i for i in range(1, 10)]:
    filtered_df = filter_top_percentile(data, perc)
    performance, days_invested = calculate_principal_with_data(filtered_df)
    performances.append(performance)

multipliers_to_optimal = []
for performance in performances:
    multipliers_to_optimal.append((performance / baseline_opt_return) * 100)

multipliers_to_optimal

[53.99803361036748,
 38.13291541598052,
 26.794073499513065,
 20.355043482889528,
 15.180462784663371,
 11.958987736793732,
 9.172150628831476,
 7.406330091005025,
 5.834895191654066]

In [25]:
performances

[2926876637.5974193,
 519798425.3310848,
 132185404.62592745,
 40376996.76518449,
 13936012.959107807,
 5235635.77994988,
 2157688.808562009,
 930460.7510461921,
 419302.8565791309,
 198253.1108880955,
 97426.34279741913,
 49976.97630353103,
 26616.82305252511,
 14696.848111218806]

In [26]:
baseline_opt_return

48511259834.9942