# Create `asset-data.json` for Mean-Variance Analyzer

In this notebook, we get and clean the financial market data that will be preloaded as a JSON file in the "Mean-Variance Analyzer" web app ([view live site](https://meanvarianceanalyzermain.gatsbyjs.io/)).

In [1]:
import pandas as pd
import yfinance as yf
import json

## Getting the monthly close price return data

We start by getting the financial market data via the yfinance API. We then sort it in alphabetical order - this allows us to store the information needed to recreate the covariance matrix for these assets on the client's device in a file approximately half the size (since the matrix is symmetric - please see the site's [Tutorial](https://meanvarianceanalyzermain.gatsbyjs.io/tutorial) page for more information on the covariance matrix and other quantities contained in this dataset). We have chosen 101 popular assets for our demo data ranging from stocks, ETFs to cryptocurrencies and more.

In [77]:
assetTickers = ['AAPL', 'MSFT', 'UNH', 'JNJ', 'V', 'WMT', 'JPM', 'CVX', 'PG', 'HD',
               'KO', 'MRK', 'MCD', 'DIS', 'CSCO', 'VZ', 'CRM', 'AMGN', 'NKE', 'HON',
               'IBM', 'GS', 'CAT', 'INTC', 'AXP', 'BA', 'MMM', 'TRV', 'DOW', 'WBA',
               'GOOGL', 'GOOG', 'AMZN', 'TSLA', 'BRK-B', 'META', 'NVDA', 'XOM', 'MA', 'LLY',
               'PFE', 'BAC', 'ABBV', 'PEP', 'COST', 'AVGO', 'TMO', 'ABT', 'ADBE', 'CMCSA',
                'INTU', 'ADP', 'GILD', 'AMD', 'AMAT', 'BKNG', 'ADI', 'ABNB', 'FISV', 'CSX',
               'CHTR', 'ATVI', 'DXCM', 'ADSK', 'AEP', 'FTNT', 'CTAS', 'CDNS', 'ORCL', 'SPY',
               'VTI', 'VEA', 'GLD', 'HLAL', 'SPUS', 'AMAGX', 'AMANX', 'BTC-USD', 'ETH-USD', 'BNB-USD',
               'XRP-USD', 'DOGE-USD', 'ADA-USD', 'SOL-USD', 'MATIC-USD', 'DOT-USD', 'TRX-USD', 'LTC-USD', 'ETC-USD', 'XLM-USD',
               'XMR-USD', '2222.SR', 'TSM', 'MC.PA', 'TCEHY', 'NESN.SW', '005930.KS', '600519.SS', 'ROG.SW', 'NVO',
               'ASML']

assetTickers.sort()

In [78]:
print(assetTickers)
print(len(assetTickers)) # Check that it contains 101 assets
print(len(set(assetTickers))) # Check that it contains 101 unique assets

['005930.KS', '2222.SR', '600519.SS', 'AAPL', 'ABBV', 'ABNB', 'ABT', 'ADA-USD', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AEP', 'AMAGX', 'AMANX', 'AMAT', 'AMD', 'AMGN', 'AMZN', 'ASML', 'ATVI', 'AVGO', 'AXP', 'BA', 'BAC', 'BKNG', 'BNB-USD', 'BRK-B', 'BTC-USD', 'CAT', 'CDNS', 'CHTR', 'CMCSA', 'COST', 'CRM', 'CSCO', 'CSX', 'CTAS', 'CVX', 'DIS', 'DOGE-USD', 'DOT-USD', 'DOW', 'DXCM', 'ETC-USD', 'ETH-USD', 'FISV', 'FTNT', 'GILD', 'GLD', 'GOOG', 'GOOGL', 'GS', 'HD', 'HLAL', 'HON', 'IBM', 'INTC', 'INTU', 'JNJ', 'JPM', 'KO', 'LLY', 'LTC-USD', 'MA', 'MATIC-USD', 'MC.PA', 'MCD', 'META', 'MMM', 'MRK', 'MSFT', 'NESN.SW', 'NKE', 'NVDA', 'NVO', 'ORCL', 'PEP', 'PFE', 'PG', 'ROG.SW', 'SOL-USD', 'SPUS', 'SPY', 'TCEHY', 'TMO', 'TRV', 'TRX-USD', 'TSLA', 'TSM', 'UNH', 'V', 'VEA', 'VTI', 'VZ', 'WBA', 'WMT', 'XLM-USD', 'XMR-USD', 'XOM', 'XRP-USD']
101
101


Now that we have the tickers for our assets of interest, let us make a pandas DataFrame of the monthly returns (in %) of the close price over the max period of our assets' data as provided by the yfinance API. We are cleaning the data by first forward filling all NaN values for close prices and then dropping any dates that are before ALL assets have data. This is one of the multiple choices in this document made by the developer that will inevitably affect the accuracy of the results - note that we are not liable for the accuracy of this data nor its resulting information as per the site's [Terms of Service](https://meanvarianceanalyzermain.gatsbyjs.io/terms). This data is also only up to November 2022 and will be stale after that - it is only meant for educational demonstration and not as financial advice.

In [79]:
%%time
df = pd.DataFrame()
for ticker in assetTickers:
    # Get monthly max period close data
    tmpDf = pd.DataFrame(yf.Ticker(ticker).history(period="max", interval="1mo")["Close"]).rename(
        columns={"Close":ticker})
    
    # Format for monthly index using the first datum of each month
    tmpDf.index = tmpDf.index.strftime('%Y-%m')
    tmpDf = tmpDf[~tmpDf.index.duplicated(keep='first')]
    
    df = df.join(tmpDf, how='outer')

# clean data
df.fillna(method='ffill', inplace=True)
df.dropna(how='all', inplace=True)

# get monthly pct return
df = df.pct_change()[1:] * 100

Wall time: 30 s


In [80]:
df.head()

Unnamed: 0_level_0,005930.KS,2222.SR,600519.SS,AAPL,ABBV,ABNB,ABT,ADA-USD,ADBE,ADI,...,V,VEA,VTI,VZ,WBA,WMT,XLM-USD,XMR-USD,XOM,XRP-USD
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1962-03,,,,,,,,,,,...,,,,,,,,,,
1962-04,,,,,,,,,,,...,,,,,,,,,,
1962-05,,,,,,,,,,,...,,,,,,,,,,
1962-06,,,,,,,,,,,...,,,,,,,,,,
1962-07,,,,,,,,,,,...,,,,,,,,,,


In [81]:
df.iloc[:5]['IBM']

Date
1962-03    -0.777853
1962-04   -14.781827
1962-05   -13.546219
1962-06   -13.553293
1962-07    14.075288
Name: IBM, dtype: float64

In [82]:
df.iloc[-5:]

Unnamed: 0_level_0,005930.KS,2222.SR,600519.SS,AAPL,ABBV,ABNB,ABT,ADA-USD,ADBE,ADI,...,V,VEA,VTI,VZ,WBA,WMT,XLM-USD,XMR-USD,XOM,XRP-USD
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-07,8.377966,2.448456,-6.164839,18.863362,-6.300597,24.584644,0.174873,12.677585,12.036278,17.708278,...,7.730206,6.679218,9.783913,-8.985216,4.538254,8.611614,5.212997,37.319939,13.18309,14.786636
2022-08,-2.768735,-5.660373,1.346372,-3.255183,-5.429781,1.928275,-5.268042,-13.607791,-8.943722,-11.880675,...,-6.317472,-5.819367,-3.728483,-8.343764,-11.509335,0.378646,-11.470707,-3.75713,-1.382439,-13.822284
2022-09,-11.055271,-3.613675,-2.676715,-11.9756,-0.185929,-7.142859,-5.737941,-2.742769,-26.306767,-7.602045,...,-10.440022,-10.133454,-9.61422,-9.184402,-9.374895,-1.723977,9.738578,-1.154516,-7.774532,46.330552
2022-10,12.633033,-3.068329,-27.903872,10.95514,9.082771,1.780277,2.252996,-6.463417,15.734006,2.353954,...,16.611322,6.40085,8.571764,-1.580199,16.242042,9.737861,-2.634779,1.248424,26.915581,-2.888653
2022-11,2.693603,-3.884886,14.740741,-1.480357,9.980779,-9.61557,7.661449,-23.235449,5.425431,18.636939,...,2.20602,11.278192,3.793228,5.902899,14.410945,7.089156,-21.907982,-10.190525,2.526857,-19.285493


In [83]:
df.iloc[-1].isna().sum()

0

## Extracting the mean-variance analysis data

For the basic mean-variance analysis used by the app, we want to extract from the data each tickers' long name (or short name if not possible), annualized mean monthly return, the variance of these monthly returns, and their covariances with respect to each of the other assets. To calculate the covariances between any two of the assets, the developer first chose to find the earliest data both assets had valid (non-NaN) data and only calculate the covariance over the periods where both assets have data, which the intention that this may better capture the diversification that mean-variance analysis aims for. Note that this affects the accuracy of the data and may not be a standard calculation for the ex post Sharpe ratio - this choice is again subject to the disclaimer above (see the [Terms of Service](https://meanvarianceanalyzermain.gatsbyjs.io/terms)).

In [84]:
firstValidMonthDict = dict()
for ticker in assetTickers:
    firstValidMonthDict[ticker] = df[ticker].notna().idxmax()

Now we use the DataFrame of historical monthly returns to create a dictionary containing the annualized mean return and variance of each asset over its max period and the covariance between each asset and each other asset that comes later than it in alphabetical order over the periods where they both have valid data. 

In [109]:
%%time
assetDataDict = dict()
for i in range(len(assetTickers)-1): # Calculate covariances for all except last asset (alphabetical order)
    assetVar = 12 * df[assetTickers[i]].std()**2 # Annualized %: multiply by 12 months/year
    assetMean = 12 * df[assetTickers[i]].mean()
    
    assetInfo = yf.Ticker(assetTickers[i]).info
    if 'longName' in assetInfo and assetInfo['longName']:
        assetTitle = assetInfo['longName']
    else:
        assetTitle = assetInfo['shortName']
        
    assetDataDict[assetTickers[i]] = {'title': assetTitle, 'annRetPct': assetMean, 
                                      'annVar': assetVar, 'cov': {}}
    
    for j in range(i+1, len(assetTickers)):
        startDate = max(firstValidMonthDict[assetTickers[i]], firstValidMonthDict[assetTickers[j]])
        assetCov = df[[assetTickers[i], assetTickers[j]]].loc[startDate:].cov().iloc[0, 1]
        assetDataDict[assetTickers[i]]['cov'][assetTickers[j]] = 12 * assetCov

# Calculate values for last asset (alphabetical order)
assetVar = 12 * df[assetTickers[-1]].std()**2
assetMean = 12 * df[assetTickers[-1]].mean()

assetInfo = yf.Ticker(assetTickers[-1]).info
if 'longName' in assetInfo and assetInfo['longName']:
    assetTitle = assetInfo['longName']
else:
    assetTitle = assetInfo['shortName']
        
assetDataDict[assetTickers[-1]] = {'title': assetTitle, 'annRetPct': assetMean, 
                                  'annVar': assetVar}

Wall time: 2min 7s


In [117]:
print(assetDataDict['XMR-USD'])
print(assetDataDict['XRP-USD'])

{'title': 'Monero USD', 'annRetPct': 165.8368470714276, 'annVar': 37150.18758084561, 'cov': {'XOM': -75.66642286219667, 'XRP-USD': 17823.68418609792}}
{'title': 'XRP USD', 'annRetPct': 255.16571242423956, 'annVar': 129560.28442044015}


## Export dictionary as JSON file

Now we can export the dictionary with the data required to calculate the ex post Sharpe ratio as a JSON file to the data folder of Mean-Variance Analyzer.

In [111]:
with open("../mean-variance-analyzer/data/asset-data.json", "w") as f:
    json.dump(assetDataDict, f, indent=2)