# A product basket optimization approach using Markowitz Portfolio Theory on Vinted dataset

### Introduction to Modern Portfolio Theory

Modern Portfolio Theory (MPT), introduced by Harry Markowitz in 1952, is a mathematical framework for constructing efficient investment portfolios. MPT is based on the idea that investors can construct portfolios that optimize expected return while minimizing risk.

The efficient frontier represents the set of portfolios that offer the highest expected return for a given level of risk, or the lowest risk for a given level of expected return.

### Key Assumptions

- **Defining expected return**: The expected return of a portfolio is the weighted average of the expected returns of its individual assets.

$E(R_p) = \sum_{i=1}^{n} w_i \cdot E(R_i)$

Where:
    - $E(R_p)$ is the expected return of the portfolio.
    - $w_i$ is the weight of asset \(i\) in the portfolio.
    - $E(R_i)$ is the expected return of asset $i$.

- **Defining risk**: The proxy of risk in MPT is the variance of the portfolio.

- **Diversification**: One of the key principles of MPT is diversification, which involves spreading investments across different asset classes with uncorrelated or negatively correlated returns.

### Mathematical Formulation

The optimization problem in MPT can be formulated as a quadratic programming problem to find the optimal portfolio weights that maximize the expected return for a given level of risk or minimize the risk for a given level of expected return, subject to certain constraints such as budget constraints and minimum or maximum weight constraints.

$\text{Maximize} \quad E(R_p) = \mathbf{w}^T \mathbf{R}$

$
\text{Subject to:} \quad
\begin{cases}
\mathbf{w}^T \mathbf{1} = 1 & \text{(Budget constraint)} \\
\mathbf{w}^T \mathbf{\Sigma} \mathbf{w} \leq \sigma^2 & \text{(Risk constraint)} \\
w_i \geq 0 & \text{(Non-negativity constraint)}
\end{cases}
$

Where:
- $E(R_p)$ is the expected return of the portfolio.
- $\mathbf{w}$ is the vector of portfolio weights.
- $\mathbf{R}$ is the vector of expected returns of the assets.
- $\mathbf{\Sigma}$ is the covariance matrix of asset returns.
- $\sigma^2$ is the target risk level.



In [2]:
import pandas as pd
from sqlalchemy import create_engine
import os
import json
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import seaborn as sns
#import plotly.io as pio
from plotly.offline import init_notebook_mode
pio.renderers.default = 'iframe'

init_notebook_mode(connected=True)

def load_credentials(path = "aws_rds_credentials.json"):
     with open(path, 'r') as file:
          config = json.load(file)

     # set up credentials
     for key in config.keys():
          os.environ[key] = config[key]

     return

time_interval = 30 #days

load_credentials()

aws_rds_url = f"postgresql://{os.environ['user']}:{os.environ['password']}@{os.environ['host']}:{os.environ['port']}/{os.environ['database']}?sslmode=require"

engine = create_engine(aws_rds_url)
sql_query = f"""WITH catalogs AS (
                    SELECT catalog_id
                    FROM public.tracking_staging
                    WHERE date >= CURRENT_DATE - INTERVAL '{time_interval} days'
                    GROUP BY catalog_id
                    HAVING COUNT(DISTINCT date) > 15
               )
               SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY price_numeric) as price, catalog_id, date
               FROM public.tracking_staging 
               WHERE date >= CURRENT_DATE - INTERVAL '{time_interval} days'
                    AND catalog_id IN (SELECT catalog_id FROM catalogs)
               GROUP BY date, catalog_id
               """
data = pd.read_sql(sql_query, engine)
#data.index = data.date
data

NameError: name 'pio' is not defined

Loading the melted dataframe into memory. 

Notice I used a CTE to select catalogs with meaningful representation (a minimum number of days threshold).

These are the products we are going to analyze. Each product has a different selling price and volatility, as prices and quantities sold can vary accross time. For this first analyzes, the proxies of returns and risk we are using are product expected price and product price standard deviation.

In [None]:
data = data.pivot_table(index = "date", columns="catalog_id", values = "price")
data.head(10)

: 

Notice there are several periods with no data for some products. We shall assume the median value in order to reduce central moments bias.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')


for col in data.columns:
    data[col] = imputer.fit_transform(data[col].values.reshape(-1, 1))

plt.figure(figsize=(12, 8))
plt.plot(data, alpha=.4)
plt.xlabel('time')
plt.ylabel('returns')

: 

Lets create several sample portfolios using the MPT formula and a generator function.

In [None]:
var_matrix = data.cov()

num_assets = len(data.columns)
num_port = 5000

port_weights = []
port_returns = []
port_volatility = []

# generator function, creates an iterator over the number of portfolios we want to generate
# its a better practice, specially if num_port -> inf
def generate_random_portfolios(num_port):
    num_assets = len(data.columns)
    var_matrix = data.cov()

    for _ in range(num_port):
        # each asset has either value 0 or 1
        weights = np.random.randint(0, 2, size=num_assets)
        # the returns of the portfolio is the matrix multiplication between (weights,)*(,expected_returns)
        returns = np.dot(weights, data.median())
        # portfolio variance is the double sum of covariance between assets as in the formula
        var = var_matrix.mul(weights, axis=0).mul(weights, axis=1).sum().sum()
        std = np.sqrt(var)

        yield weights, returns, std

for weights, returns, volatility in generate_random_portfolios(num_port):
    port_weights.append(weights)
    port_returns.append(returns)
    port_volatility.append(volatility)

: 

In [None]:
new_data = {"Revenue": port_returns, 
            "Volatility": port_volatility}

for counter, symbol in enumerate(data.columns.tolist()):
    new_data[str(symbol)+'_weight'] = [w[counter] for w in port_weights]

portfolio = pd.DataFrame(new_data)
portfolio.head(10)

: 

In [None]:
# Create heatmap using Plotly
fig = go.Figure(data=go.Heatmap(
    z=portfolio.drop(columns=["Revenue", "Volatility"], axis = 1).head(50).values,
    x=portfolio.drop(columns=["Revenue", "Volatility"], axis = 1).head(50).columns,
    y=portfolio.drop(columns=["Revenue", "Volatility"], axis = 1).head(50).index,
    colorscale='YlGnBu',
    colorbar=dict(title='Number of products')
))

fig.update_layout(
    title='Sparse matrix (product distribution across portfolios)',
    xaxis_title='Catalog_id',
    yaxis_title='Number of sample portfolios'
)
fig


: 

In [None]:
portfolio["Sharpe"] = portfolio["Revenue"]/portfolio["Volatility"]

fig = px.scatter(
    data_frame=portfolio,
    x='Volatility',
    y='Revenue',
    color='Sharpe',
    title='Scatter Plot of Portfolio',
    labels={'Volatility': 'Volatility (std)', 'Revenue': 'Expected Returns', 'Sharpe': 'Sharpe Ratio'},
    marginal_x='histogram',
    marginal_y='histogram', 
)

fig.update_layout(
    width=1200,  
    height=800,  
)

fig


: 

In [None]:
histogram_trace = go.Histogram(
    x=portfolio['Sharpe'],  # 'column_name' contains the data you want to plot
    marker_color='skyblue',  # Optional: set color of bars
    opacity=0.7,  # Optional: set opacity of bars
)


layout = go.Layout(
    title='Distribution Plot',  # Set title of the plot
    xaxis=dict(title='Sharpe'),  # Set label for x-axis
    yaxis=dict(title='Frequency'),  # Set label for y-axis
)

fig = go.Figure(data=[histogram_trace], layout=layout)
fig

: 

In [None]:
z = portfolio.drop(["Volatility", "Revenue", "Sharpe"], axis=1).sum(axis = 1)

fig = px.scatter(
    data_frame=portfolio,
    x=z,
    y='Revenue',
    color='Sharpe',
    title='Revenue per Number of Articles',
    labels={'Volatility': 'Volatility (std)', 'Revenue': 'Expected Returns', 'Sharpe': 'Sharpe Ratio'}
)

fig.update_layout(
    width=1200,
    height=800,
)

fig

: 

In [None]:
scatter3d_trace = go.Scatter3d(
    x=portfolio["Volatility"],
    y=portfolio["Revenue"],
    z=portfolio["Sharpe"],
    mode='markers',
    marker=dict(
        size=4,                    
        color=portfolio["Sharpe"],                   
        colorscale='Viridis',      
        opacity=0.8,
        line=dict(width=0.5, color='black') 
    )
)

layout = go.Layout(
    title='Overview of Sharpe, Volatility and Revenue',
    scene=dict(
        xaxis=dict(title='Volatility'),
        yaxis=dict(title='Revenue'),
        zaxis=dict(title='Sharpe')
    )
)

fig = go.Figure(data=[scatter3d_trace], layout=layout)

fig.update_layout(
    width=1200,  
    height=800,  
    scene=dict(
        xaxis=dict(title='Volatility €', tickfont=dict(size=10)),  # Adjust axis label font size
        yaxis=dict(title='Revenue €', tickfont=dict(size=10)),      # Adjust axis label font size
        zaxis=dict(title='Sharpe', tickfont=dict(size=10)),  # Adjust axis label font size
        camera_eye=dict(x=1.2, y=1.2, z=1.2)  # Adjust camera position
    ),
)
fig

: 

## Analysis of results

In [None]:
top_5_port = portfolio.sort_values("Sharpe", ascending=False).reset_index(drop= True).head(5)
top_5_port

: 

In [None]:
# Create heatmap using Plotly
fig = go.Figure(data=go.Heatmap(
    z=top_5_port.drop(columns=["Revenue", "Volatility", "Sharpe"], axis = 1).values,
    x=top_5_port.drop(columns=["Revenue", "Volatility", "Sharpe"], axis = 1).columns,
    y=top_5_port.drop(columns=["Revenue", "Volatility", "Sharpe"], axis = 1).index,
    colorscale='YlGnBu',
    colorbar=dict(title='Number of products')
))

fig.update_layout(
    title='Top 5 portfolio composition',
    xaxis_title='Catalog_id',
    yaxis_title='Index'
)
fig

: 

In [None]:
statistics = portfolio[["Revenue", "Volatility", "Sharpe"]].describe()

# Style the DataFrame
styled_statistics = statistics.style \
    .format("{:.2f}") \
    .set_caption("Statistics for Portfolio") \
    .set_table_styles([{'selector': 'caption', 'props': [('color', 'red'), ('font-size', '16px')]}]) 

styled_statistics


: 