# <span style='color:red'>Quantitative Investing with Python</span>

### Professor Juhani Linnainmaa

Dartmouth College and Kepos Capital

*Last revised:* January 22, 2025

--- 

# **Topic 5:** Portfolio Optimization

One of the main goals in investments and trading is to maximize the Sharpe ratio

Much of the discussion in investments is about this portfolio choice problem:

- We want to diversity across many assets
  - Diversification helps when assets are not perfectly correlated

- We would love to find assets that:

  1. Have high expected returns
  2. Have low standard deviation of returns
  3. Have low or negative correlation with other assets

- The same ideas apply to trading as well
  - Think of factors and signals as "assets"
  
#### <span style='color:red'>These topics are discussed *a lot* in the Investments class!</span>

- Portfolio *optimization* is important in quantitative finance


#### Plan 

1. Visualizing the investment opportunity set
2. Finding the optimal (= maximum Sharpe ratio) portfolio
3. Measuring the performance of the equal-weighted and optimal portfolio in "training" and "validation" samples
4. Improving optimization
   - Shrinkage
   - Machine-learning approach to optimal shrinkage (**a more general point about the train-validate-test paradigm**)

### Import statements

One new package that we are using here is **scipy**

- We use it to find optimal portfolios

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime

import requests
from io import BytesIO, StringIO
import zipfile

from scipy.optimize import minimize

## Function for downloading Ken French data

In [None]:
def download_french_data(url=None, csvname=None, skiplines=None):
    
    response = requests.get(url)

    # If the request is NOT successful, raise an exception
    if response.status_code != 200:
        raise Exception(f"Failed to download zip file. Status code: {response.status_code}")

    with zipfile.ZipFile(BytesIO(response.content)) as zip_file:

        # Check if the file exists in the zip archive
        if csvname in zip_file.namelist():
            # Read the CSV file directly from the zip archive
            with zip_file.open(csvname) as csv_file:
                lines = csv_file.readlines()

            # Remove rows from the beginning
            lines = lines[skiplines:]

            # Create a DataFrame from the trimmed lines using StringIO
            # First need to decode byte strings into unicode
            lines = [line.decode("utf-8") for line in lines]

            # at some point the file switches from monthly factors to annual factors and other stuff
            # we can delete what ever comes after
            for idx, line in enumerate(lines):
                if ('Annual Factors' in line) or (len(line.strip())==0): break
                
            lines = lines[:idx]
            clean_csv = '\n'.join(lines)
            df = pd.read_csv(StringIO(clean_csv))   
            
            # convert date into a format we understand and make it the index
            # also convert returns from percentages (e.g., 2.12) to decimles (e.g., 0.0212) by dividing by 100
            df['date'] = df['Unnamed: 0'].apply(lambda x: datetime.strptime(str(x), '%Y%m'))
            df = df.drop(columns='Unnamed: 0')
            df = df.set_index('date') / 100

            print(f'File {csvname} read successfully!')
            return df
        else:
            print(f'Zip file found but file {csvname} not found in the archive.')   
            return pd.DataFrame()

### Download data for 30 industry portfolios from Ken French's website

I also merge the risk-free rate into these data from Topic 3 

In [None]:
# Specify the file we want to read -- the CSV file inside has almost the same name 
url = 'https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/30_Industry_Portfolios_CSV.zip'
csvname = '30_Industry_Portfolios.CSV'

ind_data = download_french_data(url=url, csvname=csvname, skiplines=11)
industries = ind_data.columns.to_list()
n_industries = len(industries)

# open ff_data.pkl from Topic 3 and copy over RF column
ff_data = pd.read_pickle('/home/jovyan/data/ff_data.pkl')
ind_data = ind_data.merge(ff_data['RF'], left_index=True, right_index=True, how='left')
ind_data = ind_data.loc['1963-07':]
ind_data.to_pickle('/home/jovyan/data/ind_data.pkl')

print('\nData:\n')
print(ind_data.head(3))

### Split the data into three parts: "training data", "validation data", and "test date"

- I'll explain these terms in class
  - They are *very* central to machine/statistical learning
- I also record, separately, the risk-free rate for these three periods

In [None]:
train_data = ind_data.loc[:'2000-12', industries]
train_rf = ind_data.loc[:'2000-12', 'RF']

val_data = ind_data.loc['2001-01':'2015-12', industries]
val_rf = ind_data.loc['2001-01':'2015-12', 'RF']

test_data = ind_data.loc['2016-01':'2023-12', industries]
test_rf = ind_data.loc['2016-01':'2023-12', 'RF']

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    Visualizing the investment opportunity set
</div>

- Investment opportunity set is the "possibility of investments" we *could* make
  - If you have 100 assets, there are MANY (infinitely many) different portfolios you could construct<br><br>
  
  
- If we take the view that we care about expected returns and volatilities, we can characterize every portfolio with just two numbers:
  
  1. Expected return
  2. Volatility<br><br>
  
  
- Let's take five industries and create N random portfolios

In [None]:
N = 10_000
k = 5

simulation_results = []

for i in range(N):
    
    # draw random weights from the standard normal distribution
    weights = np.random.normal(loc=0.0, scale=1.0, size=k)
    weights /= np.sum(weights)
    
    # returns on the portfolio
    portfolio_return = train_data.iloc[:,:k].dot(weights)
    
    # compute annualized mean and standard deviation and append to the results-list
    simulation_results.append({'mean': 12 * portfolio_return.mean(), 'std': np.sqrt(12) * portfolio_return.std()})

# convert list into a DataFrame
simulation_results = pd.DataFrame(simulation_results)

# remove high-vol portfolios
simulation_results = simulation_results[simulation_results['std']<1]
N = len(simulation_results)

In [None]:
simulation_results.iloc[:100].plot.scatter(x='std', y='mean', title='100 random portfolios', xlabel='Volatility', ylabel='Average Return', figsize=(12,8))

In [None]:
simulation_results.plot.scatter(x='std', y='mean', title=f'{N:,} random portfolios', figsize=(12,8));

## Analysis function from Topic 2

- Computes and reports Sharpe ratio
  - If $r_f$ is provided, subtract it from returns
- Plots cumulative returns

In [None]:
def analyze_returns(r=None, rf=None, title=''):
    if rf is not None:
        sharpe_ratio = np.sqrt(12) * (r.mean() - rf) / r.std()
    else:
        sharpe_ratio = np.sqrt(12) * r.mean() / r.std()
    print(f'Strategy: {title}')
    print(f'Sharpe ratio: {sharpe_ratio:.2f}')
    r.cumsum().plot(figsize=(12,8), title=title)

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    An equal-weighted portfolio of 30 industries in training sample
</div>

In [None]:
weights_equal = np.array(len(industries) * [1. / len(industries)])
train_portfolio_return_equal = train_data.mul(weights_equal).sum(axis=1)
analyze_returns(r=train_portfolio_return_equal, rf=train_rf.mean(), title='1/N Strategy in Training Sample')

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    Mean-variance efficient portfolio in the training sample
</div>


- I first optimize numerically using returns themselves
- ```Optimization``` is the same as using Excel's solver
- We need to specify a few things:
  - A function that returns a value we are *minimizing* (in Excel, we can just point to a cell that has the right formula; here, the function is the formula)
    - The function can also accept other arguments
  - What are the initial guesses for the solution
  - What algorithm we use to minimize
  - What constraints (e.g., the weights add up to zero) do we have
  - Are there any bounds for the choice variables (e.g., weights need to be positive?)

#### Note:

- This might look a bit overwhelming, but just think about it as setting up the solver
- The steps are always the same
- If you need to solve a completely different problem, you would pretty much just copy the code, modify a few things, and you'd be done

In [None]:
# This function computes and returns the NEGATIVE of the Sharpe ratio because optimization function *minimize*

def neg_sharpe_ratio(weights, df, rf):
    portfolio_return = df.mul(weights).sum(axis=1)
    sharpe_ratio = np.sqrt(12) * (portfolio_return.mean() - rf) / portfolio_return.std()
    return -sharpe_ratio

In [None]:
# Define a constraint: it is an EQuality constraint that sets the sum of weights to 1
constraints = ({'type': 'eq', 'fun': lambda x: np.sum(x) - 1})  

# Start from an equal-weighted portfolio 
results = minimize(neg_sharpe_ratio, weights_equal, args=(train_data, train_rf.mean()), method='SLSQP', constraints=constraints)

weights_optimal = results['x']

print('Optimal weights in percentages (%)')
pd.Series(100*weights_optimal, index=industries).sort_values().round(1)

In [None]:
train_portfolio_return_optimal = train_data.mul(weights_optimal).sum(axis=1)
analyze_returns(r=train_portfolio_return_optimal, rf=train_rf.mean(), title='Optimal Portfolio in the Training Sample')

## Where do we stand?

- We have looked at data from 1960s to 2010
- An equal-weighted portfolio of the 30 industries has a Sharpe ratio of 0.42
- A mean-variance efficient portfolio has a Sharpe ratio of 0.90

# <span style='color:red'>Question</span>: Which portfolio would you prefer?

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    Mean-variance efficient portfolio in the VALIDATION sample
</div>

In [None]:
val_portfolio_return_equal = val_data.mul(weights_equal).sum(axis=1)
analyze_returns(r=val_portfolio_return_equal, rf=val_rf.mean(), title='1/N Strategy in the Validation Sample')

In [None]:
val_portfolio_return_optimal = val_data.mul(weights_optimal).sum(axis=1)
analyze_returns(r=val_portfolio_return_optimal, rf=val_rf.mean(), title='Optimal Portfolio in the Validation Sample')

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    Optimizing based on expected returns and covariances
</div>

- We don't need to have the full "historical data" to do optimization
- The optimal portfolio depends on three inputs:
  - Expected returns, variances, and covariances
- If we *have* historical data -- and we assume that these data are representative of future data -- we can do the optimization as before
- But we will also get identical results if we computed average returns, variances, and covariances, and optimize using them instead
- There is some portfolio mathematics for this
  - The function "neg_sharpe_ratio2" is almost the same as before but it now accepts just the mean returns (30 numbers) and the covariance matrix (30x30 numbers) plus the risk-free rate
- There are great benefits to doing the optimization like this because now we can modify the inputs to get more sensible results

In [None]:
# Define the Sharpe ratio calculation
def neg_sharpe_ratio2(weights, mean_returns, cov_matrix, rf):
    portfolio_return = np.dot(mean_returns, weights)
    portfolio_volatility = np.sqrt(np.dot(weights.T, np.dot(cov_matrix, weights)))
    sharpe_ratio = (portfolio_return - rf) / portfolio_volatility
    return -sharpe_ratio  # Negative for minimization

In [None]:
# Optimization constraints
constraints = ({'type': 'eq', 'fun': lambda x: np.sum(x) - 1})  # The sum of weights is 1

mean_returns = train_data.mean()
cov_matrix = train_data.cov()

# Start from an equal-weighted portfolio but then maximize Sharpe ratio
results = minimize(neg_sharpe_ratio2, weights_equal, args=(mean_returns, cov_matrix, train_rf.mean()), method='SLSQP', constraints=constraints)
weights_optimal = results['x']
train_portfolio_return_optimal = train_data.mul(weights_optimal).sum(axis=1)
analyze_returns(r=train_portfolio_return_optimal, rf=train_rf.mean(), title='Optimal Portfolio in the Training Sample (This Should be the Same as Before!)')

## A helper function for shrinking average returns and covariance matrix estimates to make them less noisy

- 'shrinkage' in statistics is about pulling estimates towards some prior
- E.g., if you think that some correlations are noise, you might want to pull correlations towards zero. 
  - In the code below, I multiple the off-diagonal elements by some number <1
- Similarly, I make the mean returns more similar to each other (I shrink them towards the overall mean)

In [None]:
def compute_cov_matrix(df: pd.DataFrame, shrink_pct: float = 0) -> pd.DataFrame:
    """
    Compute the covariance matrix of the given DataFrame and optionally shrink
    off-diagonal elements by 'shrink_pct'.
    """
    if df is None:
        raise ValueError("df cannot be None.")

    if not (0 <= shrink_pct <= 1):
        raise ValueError(f"Invalid shrink_pct={shrink_pct}. Must be between 0 and 1.")

    cov = df.cov()

    if shrink_pct > 0:
        # Create a mask for diagonal elements
        diagonal_mask = np.eye(len(cov), dtype=bool)
        # Scale only off-diagonal elements by (1 - shrink_pct)
        cov.values[~diagonal_mask] *= (1 - shrink_pct)

    return cov

def compute_mean_returns(df=None, shrink_pct=0):
    mean = df.mean()
    if shrink_pct > 1:
        raise ValueError(f'{shrink_pct=} invalid. Express shrink_pct as a number between 0 and 1')
    if shrink_pct > 0:
        overall_mean = mean.mean()
        mean = (1 - shrink_pct) * mean + shrink_pct * overall_mean
    return mean

### Let's recompute the optimal portfolio after shrinking both the means and covariances by 20%

In [None]:
mean_returns = compute_mean_returns(train_data, 0.2)
cov_matrix = compute_cov_matrix(train_data, 0.2)

# Start from an equal-weighted portfolio but then maximize Sharpe ratio
results = minimize(neg_sharpe_ratio2, weights_equal, args=(mean_returns, cov_matrix, train_rf.mean()), method='SLSQP', constraints=constraints)
weights_optimal = results['x']
train_portfolio_return_optimal = train_data.mul(weights_optimal).sum(axis=1)
analyze_returns(r=train_portfolio_return_optimal, rf=train_rf.mean(), title='Refined Optimal Portfolio in the Training Sample')

In [None]:
val_portfolio_return_optimal = val_data.mul(weights_optimal).sum(axis=1)
analyze_returns(r=val_portfolio_return_optimal, rf=val_rf.mean(), title='Refined Optimal Portfolio in the Validation Sample')

In [None]:
print('Optimal weights in percentages (%)')
pd.Series(100*weights_optimal, index=industries).sort_values().round(1)

<br>
<div style="text-align: center; font-family: 'Georgia', sans-serif; font-size: 36px; font-weight: bold; color: red;">
    Machine Learning Approach to Finding the Optimal Shrinkage
</div>

- The key idea in machine learning is the **Train-Validate-Test** paradigm
- I implement it using our training data/validation date split
  - The **idea** is the key, not the specific implementation

In [None]:
from tqdm import tqdm 

shrinkage_grid = np.arange(0.0, 1.1, 0.1)

search_results = []

for mean_shrinkage in tqdm(shrinkage_grid):
    for cov_shrinkage in shrinkage_grid:
        
        mean_returns = compute_mean_returns(train_data, mean_shrinkage)
        cov_matrix = compute_cov_matrix(train_data, cov_shrinkage)

        results = minimize(neg_sharpe_ratio2, weights_equal, args=(mean_returns, cov_matrix, train_rf.mean()), method='SLSQP', constraints=constraints)
        weights_optimal = results['x']

        val_portfolio_return = val_data.mul(weights_optimal).sum(axis=1)
        sharpe = np.sqrt(12) * (val_portfolio_return.mean() - val_rf.mean()) / val_portfolio_return.std()
        
        search_results.append({'mean_shrinkage': mean_shrinkage, 'cov_shrinkage': cov_shrinkage, 'val_sharpe': sharpe})
        
search_results = pd.DataFrame(search_results).sort_values('val_sharpe').set_index(['mean_shrinkage', 'cov_shrinkage'])
search_results        

In [None]:
import matplotlib.pyplot as plt 

best_shrinkages = search_results.iloc[-1].name

mean_returns = compute_mean_returns(train_data, best_shrinkages[0])
cov_matrix = compute_cov_matrix(train_data, best_shrinkages[1])

# Start from an equal-weighted portfolio but then maximize Sharpe ratio
results = minimize(neg_sharpe_ratio2, weights_equal, args=(mean_returns, cov_matrix, train_rf.mean()), method='SLSQP', constraints=constraints)
weights_optimal = results['x']

train_portfolio_return = train_data.mul(weights_optimal).sum(axis=1)
validation_portfolio_return = val_data.mul(weights_optimal).sum(axis=1)
test_portfolio_return = test_data.mul(weights_optimal).sum(axis=1)

analyze_returns(r=train_portfolio_return, rf=train_rf.mean(), title='Optimized Portfolio in the Training Sample')
plt.show()

analyze_returns(r=validation_portfolio_return, rf=val_rf.mean(), title='Optimized Portfolio in the Validation Sample')
plt.show()

analyze_returns(r=test_portfolio_return, rf=test_rf.mean(), title='Optimized Portfolio in the Test Sample')
plt.show()