# Stock Market Investing for the Layperson

<img src="stonks.webp" alt="Sectors" style="padding:20px">

# Objective

Our goal is to create an interactive investing tool that will help the average person create a customized stock portfolio that is set up for success. We hope to make stock market investing more accessible to the average person, helping them choose where and how to directly invest their hard-earned money without the need to rely on third-party wealth management firms.

### Modern Portfolio Theory: An introduction

<img src="markowitz.jpg" alt="Harry Markowitz" style="float: right; padding:20px; max-width: 400px; max-height: 500px; margin-left: 10px;">

**Modern Portfolio Theory** is a concept in Finance that describes ways of diversifying and allocating assets in a financial portfolio in order to maximize the portfolio's expected return given the owner's risk tolerance. American economist Harry Markowitz first introduced MPT in a 1952 paper. The theory was intended to eliminate ideosyncratic risk, which is the risk inherent in a particular investment due to its unique characteristics.

A key component of this framework is **diversification**. When using MPT, an investor bundles different types of investments together so that when some of the securities fall in value, other securities rise in equal amount. Thus, the overall portfolio stays even but as markets rise overall, the portfolio rises along with the market's inside tide.

MPT argues that any given investment's risk and return characteristics should not be viewed alone but evaluated by how it affects the overall portfolio's risk and return. That is, an investor can construct a portfolio of multiple assets that will result in greater returns without a higher level of risk. As an alternative, starting with a desired level of expected return, the investor can construct a portfolio with the lowest possible risk that is capable of producing that return.

MPT uses precise financial mathematics to carefully construct the portfolio. 
The steps involved include:

- Valuing the securities that might be included in the portfolio.
- Calculating the desired asset allocation, that is, the mix of assets.
- Performing calculations to optimize the portfolio to get the maximum amount of return for the minimum amount of risk.
- Using financial analysis to monitor the portfolio to see if it meets expectations and then making changes to the individual securities or asset mix when market warrant a change.




An important consideration in MPT is that based on statistical  measures such as **variance** and **correlation**, a single investment's performance is less important than how it impacts the entire portfolio.

MPT also assumes that investors are **risk-averse**, meaning that they prefer a less risky portfolio to a riskier one for a given level of return. As a practical matter, risk aversion implies that most people should invest in multiple asset classes (stocks, bonds, commodities, cash equivalents or cryptocurrencies for example).

#### Portfolio return

The expected return of the portfolio is calculated as a weighted sum of the returns of the individual assets:

$$E(R_p) = \sum_{i=1}^{n} w_i E(R_i) $$

Where:
- $ E(R_p) $ is the expected return of the portfolio.
- $ w_i $ is the weight of asset $ i $ in the portfolio.
- $ E(R_i) $ is the expected return of asset $ i $.
- $ n $ is the number of assets in the portfolio.

#### Portfolio return example

Let's imagine a portfolio contains four assets with the following characteristics:

- Asset 1: Weight = 20%. Expected Return = 4%. Volatility (σ1) = 5%
- Asset 2: Weight = 30%. Expected Return = 6%. Volatility (σ2) = 7%
- Asset 3: Weight = 25%. Expected Return = 10%. Volatility (σ3) = 10%
- Asset 4: Weight = 25%. Expected Return = 14%. Volatility (σ4) = 15%

The expected return of the portfolio is calculated as follows:

$$
E(R_p) = (0.04 \times 0.20) + (0.06 \times 0.30) + (0.10 \times 0.25) + (0.14 \times 0.25)
$$

Simplifying the calculation:

$$
E(R_p) = 0.008 + 0.018 + 0.025 + 0.035 = 0.086 \quad \text{or} \quad 8.6\%
$$

Therefore, the expected return of the portfolio is:

$$
E(R_p) = 0.086 \quad \text{or} \quad 8.6\%
$$

#### Portfolio risk
The portfolio's risk is a function of the variances of each asset and the correlations of each pair of assets. Following the previous example, to calculate the risk of the portfolio, an investor needs each of the four assets' variances and six correlation values, since there are six possible two-asset combinations with four assets. The variance of the portfolio is given by:

$$ \sigma_p^2 = \sum_{i=1}^{n} \sum_{j=1}^{n} w_i w_j \sigma_{ij} $$

- $ \sigma_p^2 $ is the variance of the portfolio.
- $ w_i $ and $ w_j $ are the weights of assets $ i $ and $ j $ in the portfolio.
- $ \sigma_{ij} $ is the covariance between the returns of assets $ i $ and $ j $.




The standard deviation (risk) of the portfolio is the square root of the variance or the standard deviation:

$$ \sigma_p = \sqrt{\sum_{i=1}^{n} \sum_{j=1}^{n} w_i w_j \sigma_{ij}} $$

Standard deviation is used instead of variance as it's in the same units as the assets' returns and can be utilized in calculating risk-adjusted measures. Another reason is because in the context of Finance, returns are often assumed to be normally distributed. In this distribution, about 68% of the data falls within one standard deviation of the mean and about 95% falls within two standard deviations. This property is important to understand the probability of different return outcomes.

Also, MPT and other portfolio optimization techniques such as the efficient frontier are commonly formulated in terms of standard deviation because this optimization seeks to balance returns against risk in a way that is more meaningful when risk is measured in the same units as the return.

In addition, it should be noted that because of the asset correlations, the total portfolio risk is lower than what would be calculated by a weighted sum.



#### Portfolio risk example

Following the previous example, the variance calculation involves:

Variances of individual assets (on the diagonal):
\begin{align*}
    & \sigma_{11} = \sigma_{12} = 0.05^2 = 0.0025 \\
    & \sigma_{22} = 0.07^2 = 0.0049 \\
    & \sigma_{33} = \sigma_{32} = 0.10^2 = 0.01 \\
    & \sigma_{44} = \sigma_{42} = 0.15^2 = 0.0225 \\
\end{align*}

Covariances of pairs of assets (off-diagonal elements):
\begin{align*}
    &\sigma_{12} = \sigma_{21} = 0.0006 \\
    &\sigma_{13} = \sigma_{31} = 0.0008 \\
    &\sigma_{14} = \sigma_{41} = 0.0010 \\
    &\sigma_{23} = \sigma_{32} = 0.0012 \\
    &\sigma_{24} = \sigma_{42} = 0.0014 \\
    &\sigma_{34} = \sigma_{43} = 0.0020 \\
\end{align*}

Substituting these values into the variance formula:

$$
\begin{aligned}
\sigma_p^2 &= (0.20^2 \cdot 0.0025) + (0.30^2 \cdot 0.0049) + (0.25^2 \cdot 0.01) + (0.25^2 \cdot 0.0225) \\
           &\quad + 2 \cdot 0.20 \cdot 0.30 \cdot 0.0006 + 2 \cdot 0.20 \cdot 0.25 \cdot 0.0008 + 2 \cdot 0.20 \cdot 0.25 \cdot 0.0010 \\
           &\quad + 2 \cdot 0.30 \cdot 0.25 \cdot 0.0012 + 2 \cdot 0.30 \cdot 0.25 \cdot 0.0014 + 2 \cdot 0.25 \cdot 0.25 \cdot 0.0020 \\
           &= 0.0001 + 0.000441 + 0.000625 + 0.00140625 \\
           &\quad + 0.000072 + 0.00008 + 0.0001 + 0.00009 + 0.000105 + 0.000125 \\
           &= 0.00214425
\end{aligned}
$$

The portfolio standard deviation $ \sigma_p $ is the square root of the variance:

$$
\sigma_p = \sqrt{0.00214425} \approx 0.0463 \quad \text{or} \quad 4.63\%
$$

#### Portfolio and risk-adjusted measures

According to the MPT, a portfolio frontier, also known as an efficient frontier, is a set of portfolios that maximizes expected returns for each level of standard deviation (risk). As we've seen, the expected return is the weighted sum of the individual assets' returns, whereas the standard deviation is the level of risk associated to an asset (also called volatility in Finance).

Associated to risk, there is the risk-free rate, which is the return an investor expects to earn on an asset with zero risk. Although every asset has a certain level of risk, assets with low probability of default and fixed returns (treasury bills for example) are considered risk-free.

<img src="mpt_1.webp" alt="Efficient Frontier" style="float: left; padding:10px; max-width: 500px; max-height: 500px; margin-left: 10px;">

- Point A is the **minimum variance portfolio**
- Point B is the **optimal market portfolio**
- The dashed line is the **Capital Allocation Line** or **CAL**.



In this image, the upper portion of the curve is what is the "Efficient Frontier" that is, the combination of risky-assets that maximizes expected return for a given level of stardard deviation. Therefore, any portfolio on this portion of the curve offers the best possible expected returns for a given level of risk.

- Point A on the efficient frontier is the **minimum variance portfolio**, the combination of risky assets that minimizes risk.
- Point B is the **optimal market portfolio**, which consists of at least one risk-free asset. 
- The line tangent to the efficient frontier is called the **Capital Allocation Line** or **CAL**.


The CAL is a line that depicts the risk-rewarded tradeoff of assets that carry risk. The slope of this line is called the Sharpe ratio, and can be defined as the increase in expected return per additional unit of standard deviation. In the above image, at point B, the reward-to-risk ratio (the slope of CAL) is the highest, and it's the combination that creates the optimal portfolio according to the MPT.


As a summary, according to the MPT, rational risk-averse investors should hold portfolios that fall on the efficient frontier (since they all provide the highest possible expected returns for a given level of risk). The optimal portfolio (also called market portfolio) is the combination of assets at point B, which combines one risk-free asset with one risky asset.

#### Risk-adjusted measures example

As we've seen, the Sharpe Ratio is a measure of the risk-adjusted return of the portfolio, and can be calculated as follows:

$$ \text{Sharpe Ratio} = \frac{E(R_p) - R_f}{\sigma_p} $$

Where:
- $ E(R_p) $ is the expected return of the portfolio.
- $ R_f $ is the risk-free rate.
- $ \sigma_p $ is the standard deviation of the portfolio.

Following the example portfolio, if we substitute the values in the formula, we obtain the following:

$$
\begin{aligned}
\text{Sharpe Ratio} &= \frac{0.085 - 0.03}{0.0463} \\
                    &= \frac{0.055}{0.0463} \\
                    &\approx 1.187
\end{aligned}
$$

Therefore, the Sharpe Ratio of the portfolio is approximately $ 1.187 $. It's common ground in Finance to consider a Sharpe ratio greater than 1 acceptable. Portfolios with a lower ratio don't compensate the level of risk taken. The greater the ratio, the more return the portfolio will yield on a risk-adjusted basis.

## Data Source

We chose the S&P 500 as our market. Historical data was pulled from the Yahoo Finance API. This was merged with more detailed information about individual companies, pulled from Wikipedia.

In [1]:
import pandas as pd

In [2]:
sp500_combined = pd.read_csv("sp500_combined.csv")

In [3]:
sp500_combined.head(3)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Symbol,Security,GICS Sector,GICS Sub-Industry
0,2014-04-01,113.612038,114.255852,113.436455,114.155518,76.270012,2835477.0,MMM,3M,Industrials,Industrial Conglomerates
1,2014-04-02,113.70401,113.921402,113.152176,113.712372,75.97393,3924554.0,MMM,3M,Industrials,Industrial Conglomerates
2,2014-04-03,113.896324,114.707359,113.46154,113.82943,76.052124,3200735.0,MMM,3M,Industrials,Industrial Conglomerates


In [4]:
# Convert 'Date' column to datetime
sp500_combined['Date'] = pd.to_datetime(sp500_combined['Date'])

In [5]:
# Convert 'Date' column to datetime
sp500_combined['Date'] = pd.to_datetime(sp500_combined['Date'])

# Set 'Date' as the index and sort the index
sp500_combined.set_index('Date', inplace=True)
sp500_combined.sort_index(inplace=True)

# Subset the DataFrame for the desired date range
sp500_5y = sp500_combined.loc['2019-04-01':'2024-03-31']

# Reset the index to make 'Date' a column again
sp500_5y.reset_index(inplace=True)

In [6]:
sp500_5y.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Symbol,Security,GICS Sector,GICS Sub-Industry
0,2019-04-01,28.049999,28.049999,27.360001,27.889999,26.317907,3255500.0,PHM,PulteGroup,Consumer Discretionary,Homebuilding
1,2019-04-01,160.0,161.770004,159.759995,161.470001,145.721207,2393600.0,HON,Honeywell,Industrials,Industrial Conglomerates
2,2019-04-01,80.980003,81.110001,79.970001,80.779999,63.534737,4759400.0,ABBV,AbbVie,Health Care,Biotechnology
3,2019-04-01,8.86,9.0,8.86,8.98,7.067306,45653100.0,F,Ford Motor Company,Consumer Discretionary,Automobile Manufacturers
4,2019-04-01,122.129997,122.360001,119.959999,120.480003,104.641907,950700.0,DRI,Darden Restaurants,Consumer Discretionary,Restaurants


## How to build a Portfolio





<img src="Sectors.png" alt="Sectors" style="float: left; padding:10px; max-width: 500px; max-height: 500px; margin-right: 20px;">

Stocks within each sector can be expected to respond similarly to various economic factors

#### P&G vs Chevron

<img src="PG-CVX.png" alt="Sectors" style="float: left; padding:10px; max-width: 800px; max-height: 600px; margin-right: 20px;">

### Correlations help us find the assets that best complement one another. This is the key to a diversified portfolio that should be flexible enough to compensate lows in one asset with highs in another.

<img src="heatmap.png" alt="Sectors" style="padding:20px">

In [7]:
# Import necessary libraries and modules
import pandas as pd
import datetime as dt 
import yfinance as yf
import json
from requests import Session
import requests
from requests_cache import CacheMixin, SQLiteCache
from requests_ratelimiter import LimiterMixin, MemoryQueueBucket
from pyrate_limiter import Duration, RequestRate, Limiter
class CachedLimiterSession(CacheMixin, LimiterMixin, Session):
    pass
import requests_cache
import numpy as np
from pandas_datareader import data as wb
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.optimize as sc
from scipy.cluster.hierarchy import fcluster
from pycirclize import Circos
from IPython.display import Image
import holoviews as hv
import hvplot.pandas
from bokeh.io import output_notebook
from bokeh.models import HoverTool
from scipy.optimize import minimize
import plotly.graph_objects as go
import seaborn as sns


In [8]:
# Read saved date base
df_part1 = pd.read_csv("sp500_combined.csv")

In [9]:
# Convert 'Date' column to datetime
df_part1['Date'] = pd.to_datetime(df_part1['Date'])

# Set 'Date' column as index
df_part1.set_index('Date', inplace=True)

In [10]:
# Select the columns for the correlations

selected_columns = ['GICS Sector', 'Symbol', 'Adj Close']
df_part1 = df_part1[selected_columns]
df_part1.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1226068 entries, 2014-04-01 to 2024-03-28
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   GICS Sector  1226068 non-null  object 
 1   Symbol       1226068 non-null  object 
 2   Adj Close    1226068 non-null  float64
dtypes: float64(1), object(2)
memory usage: 37.4+ MB


In [11]:
# Create a column for the calculated log value of the Adjusted Closing Price

df_part1['Log_Adj_Close'] = np.log(df_part1['Adj Close'])
df_part1.head()

Unnamed: 0_level_0,GICS Sector,Symbol,Adj Close,Log_Adj_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-04-01,Industrials,MMM,76.270012,4.33428
2014-04-02,Industrials,MMM,75.97393,4.33039
2014-04-03,Industrials,MMM,76.052124,4.331419
2014-04-04,Industrials,MMM,75.895737,4.329361
2014-04-07,Industrials,MMM,75.080139,4.318556


In [12]:
# Pivot the dataframe to show the 'Date' as the index for timeseries data
# columns as "GICS Sector", values are Log_adj_close column and aggregate
#the mean of the Log_adj_close per daily value

pivot_df = df_part1.pivot_table(index='Date', columns='GICS Sector', values='Log_Adj_Close', aggfunc='mean')
pivot_df.head()

GICS Sector,Communication Services,Consumer Discretionary,Consumer Staples,Energy,Financials,Health Care,Industrials,Information Technology,Materials,Real Estate,Utilities
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2014-04-01,3.423455,3.996665,3.635249,3.838805,3.690468,4.009264,3.870418,3.501336,3.82689,3.726187,3.305446
2014-04-02,3.423211,4.001898,3.635293,3.843113,3.692476,4.01455,3.875537,3.503208,3.833971,3.725508,3.305085
2014-04-03,3.414846,3.99629,3.638149,3.847506,3.69113,4.012202,3.874369,3.492563,3.833614,3.722461,3.308434
2014-04-04,3.395996,3.977981,3.630151,3.841237,3.674558,3.991284,3.858754,3.469738,3.825446,3.721972,3.31379
2014-04-07,3.380499,3.957151,3.629783,3.821653,3.654642,3.977089,3.841824,3.45656,3.809128,3.721574,3.309408


In [13]:
# Calculate log returns
log_returns = np.log(pivot_df / pivot_df.shift(1))

# Drop the first row as it contains NaN values due to the shift
log_returns = log_returns.dropna()


In [14]:
correlation_matrix = log_returns.corr()


In [15]:
# Initialize Holoviews
hv.extension('bokeh')
output_notebook()

sectors = ['Information Technology', 'Health Care', 'Financials', 'Consumer Discretionary', 'Industrials', 'Communication Services', 'Consumer Staples', 'Energy', 'Materials', 'Real Estate', 'Utilities']
correlation_data = np.random.rand(len(sectors), len(sectors))
np.fill_diagonal(correlation_data, 1)
correlation_matrix_df = pd.DataFrame(correlation_data, index=sectors, columns=sectors)

# Ensure that the index and columns of the DataFrame match
if not set(correlation_matrix_df.index) == set(correlation_matrix_df.columns):
    raise ValueError("Index and columns of the correlation matrix must match.")

# Prepare the DataFrame suitable for a chord diagram
data = []

for i, sector1 in enumerate(correlation_matrix_df.index):
    for j, sector2 in enumerate(correlation_matrix_df.columns):
        if i != j:
            try:
                value = correlation_matrix_df.at[sector1, sector2]
                data.append([sector1, sector2, value])
            except KeyError as e:
                print(f"KeyError: {e}")

df = pd.DataFrame(data, columns=['source', 'target', 'value'])

# Check if the DataFrame was created successfully
if df.empty:
    raise ValueError("The DataFrame for the chord diagram is empty. Please check the input correlation matrix.")

# Find the highest and lowest correlations for each sector
highest_correlation = correlation_matrix_df.replace(1, np.nan).max(axis=1)
lowest_correlation = correlation_matrix_df.replace(1, np.nan).min(axis=1)

# Create a dictionary for tooltip information
tooltip_info = {sector: (f'Highest: {highest_correlation[sector]:.2f}', f'Lowest: {lowest_correlation[sector]:.2f}')
                for sector in correlation_matrix_df.index}

# Add tooltip information to the DataFrame
df['source_highest'] = df['source'].map(lambda x: tooltip_info[x][0])
df['source_lowest'] = df['source'].map(lambda x: tooltip_info[x][1])
df['target_highest'] = df['target'].map(lambda x: tooltip_info[x][0])
df['target_lowest'] = df['target'].map(lambda x: tooltip_info[x][1])

# Create a chord diagram using Holoviews
chord = hv.Chord(df)
chord.opts(
    width=800, height=800,
    labels='index',
    node_color='index',
    cmap='Category20',
    edge_color='value',
    edge_cmap='Category20',
    edge_line_width=2,
    tools=['hover'],
    inspection_policy='edges',
    edge_hover_line_color='black',
    node_hover_line_color='black',
)

# Customize hover tool to show highest and lowest correlations
hover = HoverTool(tooltips=[
    ('Source', '@source'),
    ('Target', '@target'),
    ('Value', '@value{0.00}'),
    ('Source Highest', '@source_highest'),
    ('Source Lowest', '@source_lowest'),
    ('Target Highest', '@target_highest'),
    ('Target Lowest', '@target_lowest')
])

# Show the chord diagram with the hover tool
plot = hv.render(chord)
plot.add_tools(hover)

In [16]:
hv.output(chord)


<img src="portfolio-choices.png" alt="Sectors" style="float: left; padding:10px; max-width: 800px; max-height: 800px; margin-right: 20px;">

<img src="heatmap-selections.png" alt="Sectors" style="float: left; padding:10px; max-width: 800px; max-height: 800px; margin-right: 20px;">

<img src="starters.png" alt="Sectors" style="float: left; padding:10px; max-width: 800px; max-height: 800px; margin-right: 20px;">

In [17]:
import matplotlib.pyplot as plt
import ipywidgets as widgets
from pandas_datareader import data as wb
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [18]:
sp500_combined = pd.read_csv("sp500_combined.csv")

In [19]:
# Define function to update plot based on selected assets
def update_plot(asset1, asset2, asset3):
    assets = [asset1, asset2, asset3]
    # Filter data for selected assets
    filtered_data = sp500_combined[sp500_combined['Symbol'].isin(assets)]
    # Pivot the dataframe to have dates as index and symbols as columns
    pf_data = filtered_data.pivot(index='Date', columns='Symbol', values='Adj Close')
    # Normalize the data
    normalized_data = (pf_data / pf_data.iloc[0] * 100)
    # Plot the normalized data
    normalized_data.plot(figsize=(10, 5))
    plt.title('Normalized Adjusted Prices')
    plt.ylabel('Normalized Price (Base = 100)')
    plt.xlabel('Date')
    plt.legend(title='Symbol')
    plt.show()

# Create dropdown widgets for asset selection
asset_dropdown1 = widgets.Dropdown(
    options=sorted(sp500_combined['Symbol'].unique()),
    value='MSFT',
    description='Asset 1'
)

asset_dropdown2 = widgets.Dropdown(
    options=sorted(sp500_combined['Symbol'].unique()),
    value='CVX',
    description='Asset 2'
)

asset_dropdown3 = widgets.Dropdown(
    options=sorted(sp500_combined['Symbol'].unique()),
    value='NEE',
    description='Asset 3'
)

# Create interactive widget
interactive_plot = widgets.interactive(update_plot, asset1=asset_dropdown1, asset2=asset_dropdown2, asset3=asset_dropdown3)



In [20]:
# Display the interactive widget
display(interactive_plot)

interactive(children=(Dropdown(description='Asset 1', index=322, options=('A', 'AAL', 'AAPL', 'ABBV', 'ABNB', …

In [21]:
# Define function to update plot based on selected assets
def update_plot(asset1, asset2):
    assets = [asset1, asset2]
    # Filter data for selected assets
    filtered_data = sp500_combined[sp500_combined['Symbol'].isin(assets)]
    # Pivot the dataframe to have dates as index and symbols as columns
    pf_data = filtered_data.pivot(index='Date', columns='Symbol', values='Adj Close')
    # Normalize the data
    normalized_data = (pf_data / pf_data.iloc[0] * 100)
    # Plot the normalized data
    normalized_data.plot(figsize=(10, 5))
    plt.title('Normalized Adjusted Prices')
    plt.ylabel('Normalized Price (Base = 100)')
    plt.xlabel('Date')
    plt.legend(title='Symbol')
    plt.show()

# Create dropdown widgets for asset selection
asset_dropdown1 = widgets.Dropdown(
    options=sorted(sp500_combined['Symbol'].unique()),
    value='NFLX',
    description='Asset 1'
)

asset_dropdown2 = widgets.Dropdown(
    options=sorted(sp500_combined['Symbol'].unique()),
    value='LYV',
    description='Asset 2'
)

# Create interactive widget
interactive_plot = widgets.interactive(update_plot, asset1=asset_dropdown1, asset2=asset_dropdown2)

# Display the interactive widget
display(interactive_plot)

interactive(children=(Dropdown(description='Asset 1', index=333, options=('A', 'AAL', 'AAPL', 'ABBV', 'ABNB', …

## Efficient Frontier

Now that we have selected the stocks for our portfolio, how should our funds be allocated between the three?

In [22]:
import numpy as np
import datetime as dt
import pandas as pds
from pandas_datareader import data as pd
import scipy.optimize as sc
import plotly.graph_objects as go

In [23]:
# get data, return only the adjusted close of stocks

df = pds.read_csv("sp500_combined.csv")

# prepare data so the colums is the assets and the values are the adj closing price

pivot_df = df.pivot_table(index='Date', columns='Symbol', values='Adj Close')

# Drop any columns with NAN values

stockdata_base = pivot_df.dropna()
stockdata_base.head()

Symbol,A,AAL,AAPL,ABBV,ABNB,ABT,ACGL,ACN,ADBE,ADI,...,WTW,WY,WYNN,XEL,XOM,XYL,YUM,ZBH,ZBRA,ZTS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-10-04,111.300636,12.73,172.975876,143.502319,127.410004,94.198936,80.459999,305.807007,518.419983,172.163712,...,207.2005,29.433392,87.745781,55.559395,108.590233,90.658012,122.336212,109.266174,230.940002,169.71167
2023-10-05,109.985001,12.85,174.220963,143.269104,124.989998,94.740585,81.639999,306.073853,516.440002,170.489532,...,207.180634,29.541639,89.019051,55.412228,106.145729,89.584656,120.662544,108.877678,222.539993,170.714127
2023-10-06,110.27404,12.76,176.790787,144.036713,126.360001,95.410263,82.18,308.574585,526.679993,172.342026,...,207.776596,29.344826,91.923683,56.265778,104.37323,90.409546,118.30555,110.481461,223.850006,174.267365
2023-10-09,110.911919,12.24,178.284882,144.882034,127.769997,95.292091,82.07,308.396637,529.289978,171.69812,...,207.637527,29.453074,92.580208,56.560108,108.025368,90.528801,117.760864,110.262314,222.580002,173.294693
2023-10-10,112.915276,12.26,177.687241,144.668274,131.589996,96.079948,81.860001,308.703064,532.719971,174.323318,...,207.061432,29.630205,93.495369,56.962357,107.567635,91.383514,119.206757,111.756508,222.399994,174.316986


In [24]:
# get the assets you want to work 
# 3 assets with low pairwise correlations for best diversification

stocklist = ["MSFT", "NEE", "CVX"]

# subset stockdata_base to the 3 assets
stockdata = stockdata_base[stocklist]


In [25]:
stockdata.head()

Symbol,MSFT,NEE,CVX
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-10-04,317.154327,49.755985,159.607071
2023-10-05,317.552032,48.605957,160.448959
2023-10-06,325.407349,49.382477,158.814133
2023-10-09,327.95285,48.478176,163.209595
2023-10-10,326.530945,50.699604,163.033386


#### Why use log returns?

- Additive over time
- Symmetric around zero
- Closer to a normal distribution

We will use the SciPy package to minimize the negative Sharpe ratio

In [26]:
# calculate the returns

returns = np.log(stockdata / stockdata.shift(1)).dropna()
# returns = stockdata.pct_change().dropna()

# calculate meanReturns

meanReturns = returns.mean()

# calculate covMatrix

covMatrix = returns.cov()

In [27]:
#Final Version

# Function to calculate portfolio performance (daily basis)
def portfolioPerformance(weights, meanReturns, covMatrix, annualize=True):
    returns = np.sum(meanReturns * weights)
    std = np.sqrt(np.dot(weights.T, np.dot(covMatrix, weights)))
    if annualize:
        returns *= 252
        std *= np.sqrt(252)
    return returns, std

# Function to calculate negative Sharpe ratio
def negativeSR(weights, meanReturns, covMatrix, riskFreeRate=.03, annualize=True):
    pReturns, pStd = portfolioPerformance(weights, meanReturns, covMatrix, annualize)
    return - (pReturns - riskFreeRate) / pStd

# Function to maximize Sharpe ratio
def maxSR(meanReturns, covMatrix, riskFreeRate=.03, constraintSet=(0, 1), annualize=True):
    numAssets = len(meanReturns)
    args = (meanReturns, covMatrix, riskFreeRate, annualize)
    constraints = ({'type': 'eq', 'fun': lambda x: np.sum(x) - 1})
    bounds = tuple(constraintSet for asset in range(numAssets))
    result = sc.minimize(negativeSR, numAssets * [1. / numAssets], args=args,
                         method='SLSQP', bounds=bounds, constraints=constraints)
    return result

# Function to calculate portfolio variance
def portfolioVariance(weights, meanReturns, covMatrix, annualize=True):
    return portfolioPerformance(weights, meanReturns, covMatrix, annualize)[1]

# Function to minimize portfolio variance
def minimizeVariance(meanReturns, covMatrix, constraintSet=(0, 1), annualize=True):
    numAssets = len(meanReturns)
    args = (meanReturns, covMatrix, annualize)
    constraints = ({'type': 'eq', 'fun': lambda x: np.sum(x) - 1})
    bounds = tuple(constraintSet for asset in range(numAssets))
    result = sc.minimize(portfolioVariance, numAssets * [1. / numAssets], args=args,
                         method='SLSQP', bounds=bounds, constraints=constraints)
    return result

# Function to calculate portfolio return
def portfolioReturn(weights, meanReturns, covMatrix, annualize=True):
    return portfolioPerformance(weights, meanReturns, covMatrix, annualize)[0]

# Function to optimize the portfolio for a target return
def efficientOpt(meanReturns, covMatrix, returnTarget, constraintSet=(0, 1), annualize=True):
    numAssets = len(meanReturns)
    args = (meanReturns, covMatrix, annualize)
    constraints = ({'type': 'eq', 'fun': lambda x: portfolioReturn(x, meanReturns, covMatrix, annualize) - returnTarget},
                   {'type': 'eq', 'fun': lambda x: np.sum(x) - 1})
    bounds = tuple(constraintSet for asset in range(numAssets))
    effOpt = sc.minimize(portfolioVariance, numAssets * [1. / numAssets], args=args,
                         method='SLSQP', bounds=bounds, constraints=constraints)
    return effOpt

# Function to calculate results for plotting
def calculatedResults(meanReturns, covMatrix, riskFreeRate=.03, constraintSet=(0, 1), annualize=True):
    maxSR_Portfolio = maxSR(meanReturns, covMatrix, riskFreeRate, constraintSet, annualize)
    maxSR_returns, maxSR_std = portfolioPerformance(maxSR_Portfolio['x'], meanReturns, covMatrix, annualize)
    maxSR_allocation = pds.DataFrame(maxSR_Portfolio['x'], index=meanReturns.index, columns=["allocation"])
    maxSR_allocation.allocation = [round(i * 100, 0) for i in maxSR_allocation.allocation]

    minVol_Portfolio = minimizeVariance(meanReturns, covMatrix, constraintSet, annualize)
    minVol_returns, minVol_std = portfolioPerformance(minVol_Portfolio['x'], meanReturns, covMatrix, annualize)
    minVol_allocation = pds.DataFrame(minVol_Portfolio['x'], index=meanReturns.index, columns=["allocation"])

    efficientList = []
    targetReturns = np.linspace(minVol_returns, maxSR_returns, 100)
    efficientAllocations = []
    for target in targetReturns:
        opt_result = efficientOpt(meanReturns, covMatrix, target, constraintSet, annualize)
        efficientList.append(opt_result['fun'])
        efficientAllocations.append(opt_result['x'])

    return maxSR_returns, maxSR_std, maxSR_allocation, minVol_returns, minVol_std, minVol_allocation, efficientList, targetReturns, efficientAllocations

# Function to plot the efficient frontier
def EF_graph(meanReturns, covMatrix, riskFreeRate=.03, constraintSet=(0, 1), annualize=True):
    maxSR_returns, maxSR_std, maxSR_allocation, minVol_returns, minVol_std, minVol_allocation, efficientList, targetReturns, efficientAllocations = calculatedResults(meanReturns, covMatrix, riskFreeRate, constraintSet, annualize)

    # Max Sharpe Ratio trace
    MaxSharpeRatio = go.Scatter(
        name="Maximum Sharpe Ratio",
        mode="markers",
        x=[round(maxSR_std * 100, 2)],
        y=[round(maxSR_returns * 100, 2)],
        marker=dict(
            color="red",
            size=14,
            line=dict(
                width=3,
                color="black")),
        text=[f"Return: {round(maxSR_returns * 100, 2)}%, Volatility: {round(maxSR_std * 100, 2)}%<br>{maxSR_allocation.to_string()}"],
        hoverinfo='text'
    )

    # Min Volatility trace
    MinVol = go.Scatter(
        name="Minimum Volatility",
        mode="markers",
        x=[round(minVol_std * 100, 2)],
        y=[round(minVol_returns * 100, 2)],
        marker=dict(
            color="green",
            size=14,
            line=dict(
                width=3,
                color="black")),
        text=[f"Return: {round(minVol_returns * 100, 2)}%, Volatility: {round(minVol_std * 100, 2)}%<br>{minVol_allocation.to_string()}"],
        hoverinfo='text'
    )

    # Efficient Frontier trace
    efficient_text = [
        f"Return: {round(ret * 100, 2)}%, Volatility: {round(vol * 100, 2)}%<br>" +
        "<br>".join([f"{symbol}: {round(weight * 100, 2)}%" for symbol, weight in zip(meanReturns.index, alloc)])
        for ret, vol, alloc in zip(targetReturns, efficientList, efficientAllocations)
    ]

    EF_curve = go.Scatter(
        name="Efficient Frontier",
        mode="lines",
        x=[round(vol * 100, 2) for vol in efficientList],
        y=[round(ret * 100, 2) for ret in targetReturns],
        line=dict(color="black", width=3, dash="dashdot"),
        text=efficient_text,
        hoverinfo='text'
    )

    data = [MaxSharpeRatio, MinVol, EF_curve]

    layout = go.Layout(
        title='Portfolio Optimization with the Efficient Frontier',
        yaxis=dict(title='Annualized Return (%)'),
        xaxis=dict(title='Annualized Volatility (%)'),
        showlegend=True,
        legend=dict(
            x=0.75,
            y=0,
            traceorder='normal',
            bgcolor="#E2E2E2",
            bordercolor='black',
            borderwidth=2),
        width=800,
        height=600
    )

    fig = go.Figure(data=data, layout=layout)
    fig.show()

# Example usage
# Assuming meanReturns and covMatrix are defined
# meanReturns = pd.Series([...])  # Define your mean returns
# covMatrix = pd.DataFrame([...])  # Define your covariance matrix

using SciPy minimization operations, we can minimize the negative sharpe ratio which is an equivalency statemento of optimizing the positive sharpe ratio.

In Portfolio Diversification and Risk Management Log returns facilitate the measurement of risk and correlation between assets, which is crucial for portfolio diversification. The properties of log returns make it easier to apply risk management techniques, such as Value at Risk (VaR) and stress testing.

We are using Log returns instead of percentage returns:

Log returns are a common way to measure the performance of an investment because they are additive over time. if you have two log returns, you can add them together to get the total log return for the period.

Another advantage is that they are symmetric around zero.

That is, positive and negative log returns are equally far from zero on a logarithmic scale. making it easy to compare their performance.

A third advantage of using log returns(and this is more of an assumption) is that they are following approximately a normal distribution, which makes it convenient to analyze and model using statistical methods.

the riskFreeRate=.03 but it is a key parameter to optimizing portfolios, based on Syds Intro. The reason for .03 riskFreerate... this is the standard rate that is expected for the US Treasury Bill return = 0 volatility (std), 3% on the US Treasury Bill.

In [28]:
EF_graph(meanReturns, covMatrix)


# Predictive Models - Stock Prices

The most important aspect of Modern Portfolio Theory is diversification. Diversification not just with the types of assets you have in your portfolio but also diversifying the weights of the stock assets. Using the pairwise correlation to select the sectors to invest in and using the efficient frontier to figure out how much of your starting investment to invest in each asset, we can forecast the value of the portfolio using various machine learning models. 


### Assumptions: 
    
Stocks: Microsoft, Nextera Energy, and Chevron

Risk free asset: US Treasury Bill 3%
    
Risk averse 46.04% return and 14.07% volatility
    
    MSFT 60.78%
    NEE 20.99% 
    CVX 18.23%
    
Initial investment = 100 USD
    
MSFT = \\$60.78, NEE \\$20.99, CVX \\$18.23
    
Initial value of the investment in risk free asset = \\$25 USD
    
Total investment on 1st January 2024. 

### Random Forest

Using Random Forest, a machine learning regressor to predict the stock price of MSFT yeilded the following:

<img src="image 1.png" alt="Sectors" style="float: left; padding:10px; max-width: 600px; max-height: 600px; margin-right: 20px;">

### LSTM - Long Short Term Memory

Long Short-Term Memory - uses deep learning, recurrent neural networks aimed at dealing with the vanishing gradient, as we saw with Random Forrest. The model is able to keep up with the vast amounts of data because it only analyses an arbitrary allocated time and uses that analysis as parameters for the next arbitrary allocated time. 

In our LSTM, we use a time-step, the amount to time we look past the historical data, of 100 days. 

The following slides show is are the adjusted closing prices of our 3 stocks over 10 years: 

#### MSFT

<img src="image 2.png" alt="Sectors" style="float: left; padding:10px; max-width: 600px; max-height: 600px; margin-right: 20px;">

#### NEE

<img src="image 3.png" alt="Sectors" style="float: left; padding:10px; max-width: 600px; max-height: 600px; margin-right: 20px;">

#### CVX

<img src="image 4.png" alt="Sectors" style="float: left; padding:10px; max-width: 600px; max-height: 600px; margin-right: 20px;">

#### MSFT: 100 days past 11th April 

<img src="image 5.png" alt="Sectors" style="float: left; padding:10px; max-width: 600px; max-height: 600px; margin-right: 20px;">

Predicted: **\\$397.39**

Actual **\\$427.93**



#### NEE: 100 days past 11th April 

<img src="image 6.png" alt="Sectors" style="float: left; padding:10px; max-width: 600px; max-height: 600px; margin-right: 20px;">

Predicted: **\\$62.40**

Actual: **\\$63.94**



#### CVX: 100 days past 11th April 

<img src="image 7.png" alt="Sectors" style="float: left; padding:10px; max-width: 600px; max-height: 600px; margin-right: 20px;">

Predicted: **\\$155.59**

Actual: **\\$160.27**

#### 21.6\% Return after 100 days
<img src="image 8.png" alt="Sectors" style="float: left; padding:10px; max-height: 400px; margin-right: 20px;">

<img src="image 9.png" alt="Sectors" style="float: left; padding:10px; max-height: 700px; margin-right: 20px;">

### Random Walk


The LSTM is a great option for short term predictions in order to keep the predictions accurate. We can use the Random Walk model for longer term predictions on historical stock data. 

 - Random Walk
 
 - Stochastic Process
 
 - Volatility
 
 - Drift
 
 - Stochastic Methods for Predictions

<img src="image 10.png" alt="Sectors" style="float: left; padding:10px; max-height: 700px; margin-right: 20px;">

#### MSFT Random Walk Predictions

<img src="image 11.png" alt="Sectors" style="float: center; padding:10px; max-height: 400px; margin-right: 20px;">


Jan 2028 prediction: 605.48 

April 2024 prediction: 427.81 

11th April Actual: 427.93

<img src="image 12.png" alt="Sectors" style="float: left; padding:10px; max-height: 400px; margin-right: 20px;">


#### NEE Random Walk Predictions

<img src="image 13.png" alt="Sectors" style="float: center; padding:10px; max-height: 400px; margin-right: 20px;">

Jan 2028 prediction: 61.73 

Apr 2024 prediction: 63.82 

11th April Actual:  63.94

<img src="image 14.png" alt="Sectors" style="float: left; padding:10px; max-height: 400px; margin-right: 20px;">


#### CVX Random Walk Predictions

<img src="image 15.png" alt="Sectors" style="float: left; padding:10px; max-height: 400px; margin-right: 20px;">

Jan 2028 prediction: 106.83 

Apr 2024 Predictions: 157.051 

11th April Actual:  160.27

<img src="image 16.png" alt="Sectors" style="float: left; padding:10px; max-height: 400px; margin-right: 20px;">


<img src="image 17.png" alt="Sectors" style="float: left; padding:10px; max-height: 400px; margin-right: 20px;">


# Next Steps

In the next phase of our project:

We want the everyday user to access this tool to create a portfolio using any number of assets to calculate the weights for the most efficient allocation based on their expected rate of return and associated volatility. 

We want to deliver a more accurate model by including regressor data such as unemployment rate, GDP, and inflation rate and its effects on the stock market. 

We want to explore combining language processing with predictive models to see how the behavoiurs, sentiment, profiles of the user will affect their investment choices. 

<img src="QA.webp" alt="Sectors" style="float: center; padding:10px; max-height: 600px; margin-right: 20px;">
