# Create `asset-data.json` file for Mean-Variance Analyzer

In this notebook, we get and clean the financial market data that will be preloaded as `asset-data.json` in the "Mean-Variance Analyzer" web app ([view live site](https://mvanalyzer.dev/)).

Last run: 4/27/2023

In [1]:
import pandas as pd
import yfinance as yf
import json

## Getting monthly adjusted close % return for assets

We start by getting the financial market data via the yfinance API. We then sort it in alphabetical order - this allows us to store the information needed to recreate the covariance matrix for these assets on the client's device in a file approximately half the size (since the matrix is symmetric - please see the site's [Background](https://mvanalyzer.dev/background/) page for more information on the covariance matrix and how other quantities contained in this dataset are used in the app). We have chosen 189 popular assets for our demo data including stocks, ETFs, cryptocurrencies and more.

In [2]:
assetTickers = ['005930.KS', '2222.SR', 'AAPL', 'ABBV', 'ABNB', 'ABT', 'ADA-USD', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AEP', 
                'AJRD', 'AMAGX', 'AMANX', 'AMAT', 'AMD', 'AMGN', 'AMZN', 'APO', 'AR', 'ASML', 'ASTS', 'ATVI', 'AVGO', 
                'AXP', 'B', 'BA', 'BABA', 'BAC', 'BHVN', 'BJ', 'BKNG', 'BLK', 'BMY', 'BNB-USD', 'BNP.PA', 'BRK-B', 
                'BTC-USD', 'BX', 'C', 'CAR', 'CAT', 'CDNS', 'CHK', 'CHTR', 'CL=F', 'CMCSA', 'COIN', 'COST', 'CRM', 
                'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DBX', 'DELL', 'DIS', 'DOGE-USD', 'DOT-USD', 'DOW', 'DXCM', 
                'EBAY', 'ETC-USD', 'ETH-USD', 'EURUSD=X', 'FIS', 'FISV', 'FTNT', 'GBPUSD=X', 'GC=F', 'GD', 'GE', 
                'GILD', 'GLD', 'GOOG', 'GOOGL', 'GS', 'GSAT', 'HD', 'HLAL', 'HON', 'HOOD', 'HPQ', 'HWM', 'IBM', 'IBN', 
                'INTC', 'INTU', 'IVZ', 'JD', 'JNJ', 'JOBY', 'JPM', 'JPY=X', 'KAMN', 'KKR', 'KO', 'LLY', 
                'LTC-USD', 'LYFT', 'MA', 'MATIC-USD', 'MC.PA', 'MCD', 'META', 'MMM', 'MRK', 'MS', 'MSFT', 'MUFG', 
                'NESN.SW', 'NFLX', 'NKE', 'NOC', 'NOK', 'NU', 'NVDA', 'NVO', 'ORCL', 'OVV', 'PANW', 'PDD', 'PEP', 
                'PFE', 'PFGC', 'PG', 'PL', 'PNC', 'PYPL', 'QCOM', 'RKLB', 'ROG.SW', 'RTX', 'SAP', 'SCHW', 'SCU', 
                'SHOP', 'SI=F', 'SOFI', 'SOL-USD', 'SONY', 'SPCE', 'SPGI', 'SPR', 'SPUS', 'SPY', 'SQ', 'SWN', 'TCEHY', 
                'TDG', 'TEAM', 'TGI', 'TMO', 'TRV', 'TRX-USD', 'TSLA', 'TSM', 'TTEK', 'TXN', 'UBER', 'UNH', 'USB', 
                'V', 'VEA', 'VMW', 'VORB', 'VTI', 'VTOL', 'VZ', 'WBA', 'WFC', 'WIZEY', 'WMT', 'WSC', 'XLM-USD', 
                'XMR-USD', 'XOM', 'XRP-USD', '^CMC200', '^DJI', '^FTSE', '^GSPC', '^IXIC', '^N225', '^RUT', '^TNX', 
                '^TYX']

assetTickers.sort()

In [3]:
print(len(assetTickers)) # Check that it contains 189 assets
print(len(set(assetTickers))) # Check that it contains 189 unique assets

189
189


Now that we have the tickers for our assets of interest, let us make a pandas DataFrame of the monthly % return of the adjusted close price over the max period of our assets' data as provided by the yfinance API. We are cleaning the data by first forward filling all NaN values for close prices and then dropping any dates that are before ALL assets have data. This is one of the multiple choices in this document made by the developer that will inevitably affect the accuracy of the results - note that we are not liable for the accuracy of this data nor its resulting information as per the site's [Terms of Service](https://mvanalyzer.dev/terms/). This data is also only up to the month before running this code and will be stale after that - it is only meant for educational demonstration and not as financial advice.

In [4]:
%%time
df = pd.DataFrame()
for ticker in assetTickers:
    # Get monthly max period close data
    tmpDf = pd.DataFrame(yf.Ticker(ticker).history(period="max", interval="1mo")["Close"]).rename(
        columns={"Close":ticker})
    
    # Format for monthly index using the last datum of each month
    tmpDf.index = tmpDf.index.strftime('%Y-%m')
    tmpDf = tmpDf[~tmpDf.index.duplicated(keep='last')]
    
    df = df.join(tmpDf, how='outer')

# clean data
df.fillna(method='ffill', inplace=True)
df.dropna(how='all', inplace=True)
df.drop(df.index[-1], inplace=True) # delete data from last month that hasn't finished yet

# get monthly pct return
df = df.pct_change()[1:] * 100

Wall time: 49.6 s


In [27]:
assets_with_oldest_data = df.columns[~df.iloc[0].isnull()]

df[assets_with_oldest_data[0]].head()

Date
1962-03     3.015058
1962-04   -10.322224
1962-05    -4.014541
1962-06    -9.505765
1962-07    12.421094
Name: GE, dtype: float64

In [28]:
df.iloc[-5:]

Unnamed: 0_level_0,005930.KS,2222.SR,AAPL,ABBV,ABNB,ABT,ADA-USD,ADBE,ADI,ADP,...,XRP-USD,^CMC200,^DJI,^FTSE,^GSPC,^IXIC,^N225,^RUT,^TNX,^TYX
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-11,4.713803,-4.460431,-3.462901,11.215906,-4.461701,9.245602,-21.522338,8.298271,20.537098,9.28424,...,-12.270908,-15.717265,5.672634,6.746072,5.375289,4.366973,1.382981,2.150676,-9.173413,-9.086583
2022-12,-11.093244,-2.419092,-12.081646,0.266786,-16.291364,2.054288,-22.709585,-2.435275,-4.5838,-9.570681,...,-16.835513,-6.312361,-4.170365,-1.603041,-5.897147,-8.733166,-6.70203,-6.643236,4.7529,4.003136
2023-01,11.724934,2.647985,11.052102,-8.576205,29.953217,0.692233,58.612551,10.046652,5.001122,-5.005379,...,19.551136,37.168548,2.832178,4.294322,6.175286,10.682381,4.723637,9.691409,-9.022942,-7.899369
2023-02,-1.302932,-3.490147,2.162316,5.152609,10.953108,-7.570733,-9.841248,-12.526323,6.998309,-2.652666,...,-7.19483,0.378965,-4.193329,1.345904,-2.611249,-1.113636,0.433457,-1.80906,10.966275,7.375034
2023-03,5.610561,1.572328,12.035662,3.554254,0.908503,-0.45222,13.184416,18.95971,7.994684,1.278311,...,42.779272,18.079322,1.890728,-3.105514,3.505161,6.689952,2.17128,-4.982104,-10.776301,-6.156197


In [29]:
df.iloc[-1].isna().sum()

0

## Extracting the mean-variance analysis data

For the basic mean-variance analysis used by the app, we want to extract from the data each tickers' long name, annualized mean monthly % return, the variance of these returns, and the covariances of these returns with respect to those of each of the other assets. To calculate the covariances between any two of the assets, the developer first chose to find the earliest data both assets had valid (non-NaN) data and only calculate the covariance over the periods where both assets have data, with the intention that this may better capture the diversification that mean-variance analysis aims for (since the variance and average of each asset is calculated over its max period). Note that this affects the accuracy of the data and may not be part of the standard calculation of the ex post Sharpe ratio.

In [30]:
firstValidMonthDict = dict()
for ticker in assetTickers:
    firstValidMonthDict[ticker] = df[ticker].notna().idxmax()

Now we use the DataFrame of historical monthly returns to create a dictionary containing the annualized mean return and variance of each asset over its max period and the covariance between each asset and each other asset that comes later than it in alphabetical order over the periods where they both have valid data. 

In [31]:
%%time
assetDataDict = dict()
for i in range(len(assetTickers)-1): # Calculate covariances for all except last asset (alphabetical order)
    assetVar = 12 * (df[assetTickers[i]].std()**2) # Annualized %: multiply by 12 months/year
    assetMean = 12 * df[assetTickers[i]].mean()
    
    assetInfo = yf.Ticker(assetTickers[i]).info
    if 'longName' in assetInfo and assetInfo['longName']:
        assetTitle = assetInfo['longName']
    else:
        assetTitle = assetInfo['shortName']
        
    assetDataDict[assetTickers[i]] = {'title': assetTitle, 'annRetPct': assetMean, 
                                      'annVar': assetVar, 'cov': {}}
    
    for j in range(i+1, len(assetTickers)):
        startDate = max(firstValidMonthDict[assetTickers[i]], firstValidMonthDict[assetTickers[j]])
        assetCov = df[[assetTickers[i], assetTickers[j]]].loc[startDate:].cov().iloc[0, 1]
        assetDataDict[assetTickers[i]]['cov'][assetTickers[j]] = 12 * assetCov

# Calculate values for last asset (alphabetical order)
assetVar = 12 * (df[assetTickers[-1]].std()**2)
assetMean = 12 * df[assetTickers[-1]].mean()

assetInfo = yf.Ticker(assetTickers[-1]).info
if 'longName' in assetInfo and assetInfo['longName']:
    assetTitle = assetInfo['longName']
else:
    assetTitle = assetInfo['shortName']

assetDataDict[assetTickers[-1]] = {'title': assetTitle, 'annRetPct': assetMean, 
                                  'annVar': assetVar}

Wall time: 1min 23s


In [32]:
print(assetDataDict['^TNX'])
print(assetDataDict['^TYX'])

{'title': 'Treasury Yield 10 Years', 'annRetPct': 1.010207723321614, 'annVar': 811.0913491705217, 'cov': {'^TYX': 552.9169091166535}}
{'title': 'Treasury Yield 30 Years', 'annRetPct': -0.768824488479595, 'annVar': 428.88240956769164}


## Export dictionary as JSON file

Now we can export the dictionary with the data required by the [Mean-Variance Analyzer](https://mvanalyzer.dev/) web app as a JSON file to its data folder.

In [35]:
with open("../mean-variance-analyzer/data/asset-data.json", "w") as f:
    json.dump(assetDataDict, f, indent=2)