# Stock and ETFs analysis and strategy

#### Author: Matteo Caorsi

The goal of this simply otebook is to go through a few very basic analysis on how and on what to invest.

There are some **assumptions** made in this model:
 1. We do not like risks and thus we prefer to ot lose than to have a chance of gaining a lot of money on a bet
 2. We do not do HFT. We do long term strategies that do not require constant surveillance of the stock market
 3. The future will have similar stochastic properties as the past; note that we do not imply that financial markets are stationary (which are clearly not), but rather that the global market will continuously grow
 
The analysis is divided into two main sections:
 1. Stock market analysis and simulations
 2. ETFs analysis and simulations
 


In [None]:
## importing the main tools
# type hinting
from numbers import Number
from typing import Tuple, Callable, List

# data
from get_all_tickers import get_tickers as gt
import numpy as np
import yfinance as yf
import pandas as pd
from tqdm import tqdm
import os.path
import random

#utils
from utils import *

# linear regression
from sklearn.linear_model import LinearRegression

# plot
import plotly.express as px
import plotly.figure_factory as ff

# optimisation
from scipy.optimize import minimize
from scipy.optimize import LinearConstraint

## Basic manipulation

We use `yfinance` to get data, and we store them on csv files for convenience.

In [None]:
# example of manipulations
stock_name = 'ENI.MI'
# Value
stock = yf.Ticker(stock_name)
stock_info = stock.fast_info
# print(stock_info.keys())
last_price = stock_info['last_price']
sector = stock.info['sector']
print("market sector: ",sector)
previous_close_price = stock_info['regular_market_previous_close']
print('market price ', last_price)
print('previous close price ', previous_close_price)
in_stmt = stock.income_stmt
in_stmt

Here we display the time series of dividends of the stock `PXD` over the last 20+ years

In [None]:
show_dividends_online("PXD")

Here we display the time series of the price evolution of the stock `PXD` over the last 20+ years

In [None]:
show_prices_online("PXD")

## Looking for Stocks and funds

Let's start by getting the largest 400 public companies in the US market

In [None]:
# let's extract the 400 tickers
names = get_tickers("stocks_tickers_sp500.csv")

In [None]:
# get the macro data for each stock, like the ROE of the last year, the EBITDA, the industry sector
data_df = get_macro_data(names)

In [None]:
# 2D interactive plot
px.scatter(data_df[data_df["Value"] > 0], x="ROE", y="EBITDA", size="Value", color = "Sector",
           hover_data=["Name"], width=1000, height=800)

In [None]:
# get the largest stocks (top 9) per industry sector
selected_data_df, sectors = select_data_per_sector(data_df)
selected_data_df

In [None]:
names = selected_data_df["Name"].values.tolist()
print(names)

## Load or store data for selected stocks

In [None]:
# let's get the price and dividends time series for the selected stocks
df_p, df_d = load_or_store_data(names)
df_p.head(10)

## Training and validation data

In order to avoid data leakage, we split the time frame we have (almost 40 years!) into six conseqcutive intervals, roughly of the same duration: $40/6 \sim 7$ years each.

This splitting will be used to calibrate the return and correlation matrix (on split $n$ say) and then to make the simulations of the portfolio in the following split ($n+1$).

In [None]:
# picking one split (7 years for training and 7 following years for validation)
splits = pick_a_split(df_p)

## Portfolio Optimisation

In this section we want to define the optimisation function. The function we use is rathher standard and was discovered by Markovitz in 1954.

In short, we want to distribute our wealth to both minimise the risk and maximise the gain of our portfolio.

Let's go through the two sides then:
 1. **gain** can be both the growth of the stock price $g_p$ (defined as the slope of the regression line) but also it's dividends $g_d = \sum_t d_t$
 2. **risk** is the volatility of a single stock (i.e. its standard deviation). However, we would also consider correlations between stocks time series of the prices. In particular, we would like to avoid getting two very correlated stocks and put a lot of money on both, as we will nott be very protected in case onne on the two falls. The best we can use is the covariance matrix and minimise it.
 $$ C_{ij} = \frac{\sum_t(p_t^i-\mu^i)(p_t^j-\mu^j)}{\sigma_i \sigma_j}$$
 
Given that $\bf w$ is the weight vector (i.e. $\sum_i w_i=1$ and $w_i \ge 0, \forall i$), the formula of the optimisation function is:

$$ \mathcal L({\bf w}) = {\bf w} \cdot g_p + {\bf w} \cdot g_d + 2 {\bf w}^T \cdot C \cdot {\bf w} $$

The minimisation problem is a constrained misimisation so formulated:
$$\nabla_w \mathcal L({\bf w}) = 0,  \sum_i w_i=1 ,w_i \ge 0, \forall i$$

The index $i$ runs through the available stocks.

In [None]:
# unit test
cov_mat: np.ndarray = np.array([[1,-1],[-1,2]])  # second is more risky, decorrrelated from first
gain: np.ndarray = np.array([0.5,1]) # second gains more
gain_div: np.ndarray = np.array([0.01,0.011]) # second gains more
# optimisation problem (following the Markovitz model)
optimise(portfolio_gain_risk_no_constr, cov_mat, gain, gain_div)

## Select a few stocks and for each get the gain and the overall covariance matrix

We will now first compute the gains and covariance matrices of some selected stocks (the largest companies per sector), and then use that data to calibrate the loss fuction $\mathcal L({\bf w})$ and obtain the portfolio distribution of wealth ${\bf w}$.

In [None]:
# starts from the ticker names selected above

gain_div, gain = compute_gains(df_p, df_d, names, splits)

In [None]:
# covariance matrix
cov_mat = compute_cov_mat(df_p, names, splits)

In [None]:
gain_all = gain
gain_div_all = gain_div
cov_mat_all = cov_mat
# display the covariance matrix
fig = px.imshow(cov_mat_all, x=names, y=names, height=800, width=800)
fig.show()

### Portfolio optimisation specific to each sector

We want to run our constrained optimisation on subsets of stocks, in particular, in order to impose a diversification of our portfolio, we will optimise the portfolio in each sector and simply pick the stock where it is suggested to put most bets per sector.

In [None]:

best_picks = optimise_per_sector(portfolio_gain_risk_no_constr_with_div, cov_mat_all, gain_all, 
                                 gain_div_all, sectors, selected_data_df, names)


In [None]:
# these are teh best picks per sector
best_picks

## Strategy 1: accumulation plan

The first strategy we are going to simulate is the accumulation plan. In short, we simulate that ,every 22 working days ( a mmonth roughly) we buy a small amount of stocks (300USD) for 36 months. The hope would be to average out the volatility.

In [None]:
# let's compute the ROIs on the price increase for a portfolio in which we have
# put equal amount of money to each one of the top 9 companies per sector

rois = []
for name in tqdm(names):
    _,_, roi = accumulate(df_p,36,300,name,splits)
    rois.append(roi)

# average gain of the portfolio
print("Mean ROI:",np.mean(rois))
px.histogram(rois)

In [None]:
# the results in the validation split
list(zip(rois,names))

In [None]:
# let's compute the ROIs on the price increase for a portfolio in which we have
# put equal amount of money to each one of the best picks of the previous step

rois = []
for name in tqdm(best_picks):
    _,_, roi = accumulate(df_p,36,300,name,splits)
    rois.append(roi)

print("Mean ROI:",np.mean(rois))
px.histogram(rois)

In [None]:
# results for the best picks in the validation split
list(zip(rois,best_picks))

## Strategy 2: dividend reinvestment

This other strategy focuses on getting dividends and re-investing such dividends in the same stocks, over and over

In [None]:
# re-optimise taking only dividends into account

best_picks = optimise_per_sector(portfolio_gain_risk_no_constr_only_div, cov_mat_all, gain_all, 
                                 gain_div_all, sectors, selected_data_df, names)

In [None]:
# let's compute the ROIs using this second strategy

rois = []
for name in tqdm(best_picks):
    roi = dividend_reinvest(df_p, df_d, 100, name,splits)
    rois.append(roi)

print("Mean ROI:",np.mean(rois))
px.histogram(rois)

In [None]:
# results for the dividends strategy on the best picks
list(zip(rois,best_picks))

## Benchmarking over all splits

We now run a more extensive analysis on all splits to analyse how the portfolio changes over time and also how the ROIs change.

Since the dividend reivestment strategy seems more proficient, we will use that one only.

In [None]:
splits_list = [((0, 1450), (1550, 3099)), 
               ((1550, 3000), (3100, 4649)), 
               ((3100, 4550), (4650, 6199)), 
               ((4650, 6100), (6200, 7749)),
               ((6200, 7650), (7750, 9299)), 
              ]

output_print = []

for splits in splits_list:
    gain_div, gain = compute_gains(df_p, df_d, names,splits)
    cov_mat = compute_cov_mat(df_p, names,splits)

    gain_all = gain
    gain_div_all = gain_div
    cov_mat_all = cov_mat

    best_picks = optimise_per_sector(portfolio_gain_risk_no_constr_with_div, cov_mat_all, gain_all, 
                                     gain_div_all, sectors, selected_data_df, names)

    rois = []
    for name in tqdm(best_picks):
        roi = dividend_reinvest(df_p, df_d, 100, name, splits)
        rois.append(roi)

    output_print.append("Mean ROI:"+ str(np.mean(rois))+ " for splits:"+ str(splits)+ "results: "+ str(list(zip(rois,best_picks))))


In [None]:
## results of the benchmark
print("\n\n".join(output_print))

## ETFs analysis

Instead of considering stocks, we will now focus o ETFs and only on the 100 largest ones.

Also for ETFs, we will first select the top 9 per sector (in this case the sector is where the most of the stocks are) and optimise our portfolio using markovitz.

In [None]:
# get etfs tickers
names = get_tickers("etfs.csv")
print(names)


In [None]:
df_etfs = get_macro_etfs(names)

In [None]:
# select best ETF in terms of market cap, per sector
data_df, sectors = add_sector_to_etfs(df_etfs)
selected_data_df, sectors = select_data_per_sector(data_df)
names = selected_data_df["Name"].values.tolist()

In [None]:
# get the ETFs time series for prices and dividends
df_p, df_d = load_or_store_data(names)
df_p.head(10)

In [None]:
## benchmark etfs over all time splits
splits_list = [((0, 1450), (1550, 3099)), 
               ((1550, 3000), (3100, 4649)), 
               ((3100, 4550), (4650, 6199)), 
               ((4650, 6100), (6200, 7749)),
               ((6200, 7650), (7750, 9299)), 
              ]

output_print = []

for splits in splits_list:
    gain_div, gain = compute_gains(df_p, df_d, names,splits)
    cov_mat = compute_cov_mat(df_p, names, splits)

    gain_all = gain
    gain_div_all = gain_div
    cov_mat_all = cov_mat
    try:
        best_picks = optimise_per_sector(portfolio_gain_risk_no_constr_with_div, cov_mat_all, gain_all, 
                                         gain_div_all, sectors, selected_data_df, names)

        rois = []
        for name in tqdm(best_picks):
            roi = dividend_reinvest(df_p, df_d, 100, name, splits)
            rois.append(roi)

        output_print.append("Mean ROI:"+ str(np.mean(rois))+ " for splits:"+ str(splits)+ "results: "+ str(list(zip(rois,best_picks))))
    except ValueError:
        print("The split does not contain meaningful data (probably too old). Skipping...")

In [None]:
# results of the ETFs benchmarking
print("\n\n".join(output_print))