# ETF Sector Breakdowns

This project provides a solution to the lack of insight provided to users invested primarily in Exchange-Traded Funds (ETFs) with Robinhood. Robinhood has a feature that allows users to visualize their portfolio by sector breakdown, however, this breakdown is not able to be seen if the portfolio is solely invested in ETFs. 

Traditionally, a portfolio breakdown would appear similarly to the following example structure: Technology: 34%, Healthcare: 45%, Financials: 15%, Telecommunications: 6%, providing a high-level insight for the user to get a quick understanding of the sectors that makeup their portfolio. For an account that only holds ETFs, the breakdown shows: ETF: 100% and provides no further insight into the sector distributions.

## Workflow
* [1: Retrieving Robinhood Portfolio](#section1)
* [2: Webscraping ETF Sector Breakdown](#section2)
    * [2.1: Retrieving ETF Sector Breakdowns](#section2.1)
    * [2.2: Retrieving ETF Stock Holdings](#section2.2)
    * [2.3: Finding the corresponding sector for each stock](#section2.3)
* [3: Connecting to Google Sheets API](#section3)

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import matplotlib.pyplot as plt
import numpy as np
import cv2
from datetime import datetime
import time
from tabulate import tabulate
import robin_stocks.robinhood as r
import investpy
import gspread as gs
from df2gspread import df2gspread as d2g
from oauth2client.service_account import ServiceAccountCredentials
from datetime import date
from dateutil.relativedelta import relativedelta

In [2]:
pd.set_option("display.precision", 8)

In [3]:
starttime = datetime.now()

<a class="anchor" id="section1"></a>

# Retrieving Robinhood Portfolio

The below code retrieves login information from an encrypted file and then uses robin_stocks to access the Robinhood portfolio. As a security measure, if the login information is found to be invalid, a webcam photo is taken and the picture is saved locally with a file name composed of the timestamp of the invalid login and the file path attempted to be accessed.

In [4]:
path = input("Enter login file path: ")
def capture(file):
    oldtime = time.time()
    cap = cv2.VideoCapture(0)
    ct = datetime.now()
    file = 'Captures/' + str(ct.timestamp()) + '_' + file + '_.jpg'
    while oldtime + 4 >= time.time():
        ret, frame = cap.read()
        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2BGRA)
    out = cv2.imwrite(file, frame)
    cap.release()
    cv2.destroyAllWindows()

Enter login file path: login.txt


In [5]:
try:
    with open(path) as f:
        lines = f.readlines()
    username = lines[0][:-1]
    password = lines[1][:-1]
    spread_key = lines[2][:-1]
    gdapi = lines[3][:-1]
except:
    capture(path)

In [6]:
r.login(username, password)
portfolio = r.build_holdings()

In [7]:
df = pd.DataFrame(portfolio)
df = df.transpose()
df = df.reset_index()
df.rename(columns = {'index':'ticker'}, inplace = True)
robinhood_portfolio = df
robinhood_portfolio.head()

Unnamed: 0,ticker,price,quantity,average_buy_price,equity,percent_change,intraday_percent_change,equity_change,type,name,id,pe_ratio,percentage
0,QQQ,321.66,0.293392,340.8409,94.37,-5.63,0.0,-5.627523,etp,Invesco QQQ,1790dd4f-a7ff-409e-90de-cad5efafde10,26.211685,13.38
1,VTWO,77.82,1.306165,76.56,101.65,1.65,0.0,1.645768,etp,Vanguard Russell 2000 Index Fund,372bc2fa-2916-4a5b-acbd-40b135828931,41.24056,14.41
2,SPY,414.12,0.225154,444.1405,93.24,-6.76,0.0,-6.759236,etp,SPDR S&P 500 ETF,8f92e76f-1e0e-4478-8580-16a6ffcfaef5,20.71372,13.22
3,SOXX,413.6,0.212397,470.8164,87.85,-12.15,0.0,-12.152592,etp,iShares Semiconductor ETF,4b298506-cbb8-492e-807b-23bb4787378a,20.205214,12.46
4,XBI,93.2304,1.200336,83.31,111.91,11.91,0.0,11.907813,etp,SPDR S&P Biotech ETF,2d764d73-9ad8-4cf5-9a2b-443d58d9a2b5,-6.773857,15.87


In [8]:
r.logout()

<a class="anchor" id="section2"></a>

# Webscraping ETF data

Now that the Robinhood portfolio has been accessed, the ETFs' data is scraped from marketwatch.com and stored for each ticker.

<a class="anchor" id="section2.1"></a>
## Retrieving ETF Sector Breakdowns

### Adding sector dictionary to standardize sector names across multiple DFs
Updates to the dictionary can be made on a case by case basis as new key errors arise with the expansion of the portfolio. Because all of the web data is scraped from marketwatch.com, over time, all varieties of sector names will be captured and updates will no longer be needed.

In [9]:
real_sector_as_key = {'Information Technology': ['Technology', 'Telecommunications'], 
                      'Health Care': ['Health Care', 'Health Care/Life Sciences'], 
                      'Financials': ['Financials', 'Financial Services'], 
                      'Consumer Discretionary': ['Automotive', 'Consumer Services', 'Business/Consumer Services'], 
                      'Communication Services': ['Communication Services'], 
                      'Industrials': ['Industrials', 'Industrial Goods'], 
                      'Consumer Staples': ['Consumer Goods'], 
                      'Energy': ['Oil & Gas', 'Companies on the Energy Service'], 
                      'Utilities': ['Utilities', 'Retail/Wholesale'], 
                      'Real Estate': ['Real Estate/Construction'], 
                      'Materials': ['Basic Materials'],
                      'NaN': [np.nan, 'N/A', 'Non Classified Equity']}

real_sector_as_value = {'Technology': 'Information Technology',
                        'Telecommunications': 'Information Technology',
                        'Health Care': 'Health Care',
                        'Health Care/Life Sciences': 'Health Care',
                        'Financials': 'Financials',
                        'Financial Services': 'Financials',
                        'Automotive': 'Consumer Discretionary',
                        'Business/Consumer Services': 'Consumer Discretionary', 
                        'Consumer Services': 'Consumer Discretionary',
                        'Communication Services': 'Communication Services',
                        'Industrials': 'Industrials',
                        'Industrial Goods': 'Industrials',
                        'Consumer Goods': 'Consumer Staples',
                        'Oil & Gas': 'Energy',
                        'Companies on the Energy Service': 'Energy',
                        'Utilities': 'Utilities',
                        'Retail/Wholesale': 'Utilities',
                        'Real Estate/Construction': 'Real Estate',
                        'Basic Materials': 'Materials',
                         np.nan: 'NaN',
                        'N/A': 'NaN',
                        'Non Classified Equity': 'NaN'}

In [10]:
def scrape_etf(ticker):
    url = "https://www.marketwatch.com/investing/fund/" + ticker + "/holdings"
    response = requests.get(url)
    response_code = response.status_code

    if response_code == 200:
        html_content = response.content
        html_content_string = response.text
        soup = BeautifulSoup(html_content, 'html.parser')
        sectors = soup.find_all("td", {"class": "table__cell w75"})
        sector_percent_list = []
        for sector in sectors:
            sector_percent_list.append(sector.text)
        percentage = soup.find_all("td", {"class": "table__cell w25"})
        percentage_list = []
        for curr_percent in percentage:
            percentage_list.append(curr_percent.text)
    return ticker, sector_percent_list, percentage_list

In [11]:
sector_dict = {}
sum1 = 0
for index in range(len(df)):
    ticker_percent_of_portfolio = '%.4f' % (eval(df['percentage'][index])/100)
    ticker = df['ticker'][index]
    ticker = ticker.lower()
    data = scrape_etf(ticker)
    sector_name_list = data[1]
    sector_name_list = sector_name_list[:sector_name_list.index('Stocks')]
    sector_breakdown_list = data[2]
    for index in range(len(sector_name_list)):
        percentage = eval(sector_breakdown_list[index].replace('%', ''))
        percentage = '%.4f' % (percentage / 100)
        percentage = (eval(percentage) * eval(ticker_percent_of_portfolio))
        sum1 += percentage
        try:
            sector_dict[sector_name_list[index]] += percentage
        except:
            sector_dict[sector_name_list[index]] = percentage

In [12]:
sector_dict

{'Technology': 0.4071992200000001,
 'Consumer Services': 0.048583180000000004,
 'Consumer Goods': 0.03452028,
 'Health Care': 0.33779383,
 'Industrials': 0.050009370000000004,
 'Telecommunications': 0.00400466,
 'Utilities': 0.01043227,
 'Non Classified Equity': 0.025070070000000003,
 'Basic Materials': 0.00838495,
 'Financials': 0.055762019999999995,
 'Oil & Gas': 0.012133270000000002}

In [13]:
percent_total = 0
for sector, percent in sector_dict.items():
    percent_total += percent
sectors_list = []
percents_list = []
for sector, percent in sector_dict.items():
    sectors_list.append(real_sector_as_value[sector])
    percents_list.append(percent / percent_total)

In [14]:
portfolio_sector_breakdown_df = pd.DataFrame(list(zip(sectors_list, percents_list)),
               columns =['Sector', 'Percent_of_Portfolio'])
portfolio_sector_breakdown_df.head(10)

Unnamed: 0,Sector,Percent_of_Portfolio
0,Information Technology,0.40970122
1,Consumer Discretionary,0.04888169
2,Consumer Staples,0.03473239
3,Health Care,0.33986937
4,Industrials,0.05031665
5,Information Technology,0.00402927
6,Utilities,0.01049637
7,,0.02522411
8,Materials,0.00843647
9,Financials,0.05610464


<a class="anchor" id="section2.2"></a>
## Retrieving ETF stock holdings data

In [15]:
def scrape_etf_holdings(ticker):
    url = "https://www.marketwatch.com/investing/fund/" + ticker + "/holdings"
    response = requests.get(url)
    response_code = response.status_code
    ticker_percent = float(df.loc[df['ticker'] == ticker.upper()].percentage) / 100

    if response_code == 200:
        html_content = response.content
        html_content_string = response.text
        soup = BeautifulSoup(html_content, 'html.parser')
        tables = soup.find_all("tr", {"class": "table__row"})
        scraped_text = []
        etf_holdings = []
        for text in tables:
            scraped_text.append(text.text)
        attach = 0
        for text in scraped_text:
            if 'Company' in text:
                attach += 1
                continue
            if attach == 2:
                etf_holdings.append(text.split())
        name_to_ticker = {}
        ticker_percent_holding = {}
        for holding in etf_holdings:
            try:
                name = ' '.join(holding[:-2])
                ticker = holding[-2]
                percent_holding = holding[-1]
                percent_holding = percent_holding.replace('%', '')
                percent_holding = '%.4f' % (eval(percent_holding) / 100)
                name_to_ticker[ticker] = name
                ticker_percent_holding[ticker] = eval(percent_holding) * ticker_percent
            except:
                continue
    return name_to_ticker, ticker_percent_holding

In [16]:
master_holdings = {}
company_names = {}

In [17]:
def add_holding(ticker):
    name_to_ticker = scrape_etf_holdings(ticker)[0]
    ticker_percent_holding = scrape_etf_holdings(ticker)[1]
    for stock, percentage in ticker_percent_holding.items():
        try:
            master_holdings[stock] += percentage
        except:
            master_holdings[stock] = percentage
    for ticker, name in name_to_ticker.items():
        company_names[ticker] = name
for index in range(len(df)):
    etf = df['ticker'][index]
    add_holding(etf.lower())

## Creating Stock Ticker to Name DataFrame

In [18]:
company_list = []
ticker_list = []
for ticker, name in company_names.items():
    ticker_list.append(ticker)
    company_list.append(name)
company_names_df = pd.DataFrame(list(zip(ticker_list, company_list)))
company_names_df.columns = ['Ticker', 'Company']
company_names_df.head()

Unnamed: 0,Ticker,Company
0,AAPL,Apple Inc.
1,MSFT,Microsoft Corp.
2,AMZN,Amazon.com Inc.
3,TSLA,Tesla Inc.
4,GOOG,Alphabet Inc. Cl C


In [19]:
ticker_list = []
ticker_percent_list = []
for ticker, percent in master_holdings.items():
    ticker_list.append(ticker)
    ticker_percent_list.append(percent)
master_holdings_df = pd.DataFrame(list(zip(ticker_list, ticker_percent_list)))
master_holdings_df.columns = ['Stock', 'Percent_of_Portfolio']

## Finding the corresponding sector for each stock

In [20]:
def scrape_stock_sector(ticker):
    url = "https://www.marketwatch.com/investing/stock/" + ticker + "/company-profile?mod=mw_quote_tab"
    response = requests.get(url)
    response_code = response.status_code

    if response_code == 200:
        html_content = response.content
        html_content_string = response.text
        soup = BeautifulSoup(html_content, 'html.parser')
        info = soup.find_all("li", {"class": "kv__item w100"})
        attach = 0
        sector = np.nan
        for text in info:
            text = text.text
            if 'Sector' in text:
                sector = text.split()
                return ' '.join(sector[1:])
        return sector

In [21]:
sector_list = []
for ticker in ticker_list:
    sector_list.append(real_sector_as_value[scrape_stock_sector(ticker.lower())])

In [22]:
stock_sectors_df = pd.DataFrame(list(zip(ticker_list, sector_list)))
stock_sectors_df.columns = ['Ticker', 'Sector']
stock_sectors_df.head()

Unnamed: 0,Ticker,Sector
0,AAPL,Information Technology
1,MSFT,Information Technology
2,AMZN,Utilities
3,TSLA,Consumer Discretionary
4,GOOG,Information Technology


## Checking for standardization in sector names across multiple DFs

In [23]:
portfolio_sector_breakdown_df.head()

Unnamed: 0,Sector,Percent_of_Portfolio
0,Information Technology,0.40970122
1,Consumer Discretionary,0.04888169
2,Consumer Staples,0.03473239
3,Health Care,0.33986937
4,Industrials,0.05031665


In [24]:
stock_sectors_df.head()

Unnamed: 0,Ticker,Sector
0,AAPL,Information Technology
1,MSFT,Information Technology
2,AMZN,Utilities
3,TSLA,Consumer Discretionary
4,GOOG,Information Technology


In [25]:
all_sectors_found = []
for sector in portfolio_sector_breakdown_df.Sector:
    if sector not in all_sectors_found:
        all_sectors_found.append(sector)
for sector in stock_sectors_df.Sector:
    if sector not in all_sectors_found:
        all_sectors_found.append(sector)
all_sectors_found

['Information Technology',
 'Consumer Discretionary',
 'Consumer Staples',
 'Health Care',
 'Industrials',
 'Utilities',
 'NaN',
 'Materials',
 'Financials',
 'Energy',
 'Real Estate']

# Using InvestPy to retreive top held stocks market history

In [26]:
def get_stock_history(ticker):
    today = date.today()
    current_day = today.strftime("%m/%d/%Y")
    ten_years_ago = (today - relativedelta(years=10)).strftime("%m/%d/%Y")
    try:
        df = investpy.get_stock_historical_data(stock=ticker.upper(),
                                            country='United States',
                                            from_date=ten_years_ago,
                                             to_date=current_day)
        return df, ticker
    except:
        pass

<a class="anchor" id="section3"></a>
# Connecting to Google Sheets API

All DataFrames are now being written to a Google Sheets page that serves as a live connection for a Tableau Dashboard. The Google Sheets and Google Drive API keys are retrieved from an encrypted .json file.

In [27]:
import gspread as gs
from df2gspread import df2gspread as d2g
from oauth2client.service_account import ServiceAccountCredentials

In [28]:
scope = ['https://spreadsheets.google.com/feeds',
         'https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name(
    gdapi, scope)
gc = gs.authorize(credentials)

In [29]:
wks_name1 = "Sector_Breakdown"
wks_name2 = "Stock_Holdings"
wks_name3 = "Company Names"
wks_name4 = "Ticker and Sector"
df1 = portfolio_sector_breakdown_df
df2 = master_holdings_df
df3 = company_names_df
df4 = stock_sectors_df
wks = d2g.upload(df=df1, gfile=spread_key, wks_name=wks_name1, credentials=credentials, row_names=False)
wks = d2g.upload(df=df2, gfile=spread_key, wks_name=wks_name2, credentials=credentials, row_names=False)
wks = d2g.upload(df=df3, gfile=spread_key, wks_name=wks_name3, credentials=credentials, row_names=False)
wks = d2g.upload(df=df4, gfile=spread_key, wks_name=wks_name4, credentials=credentials, row_names=False)

## Retrieving top holdings market history and writing to unique sheet

In [30]:
def upload_stock_data(ticker):
    try:
        wks_name = ticker
        df = get_stock_history(ticker)[0]
        wks = d2g.upload(df=df, gfile=spread_key, wks_name=wks_name, credentials=credentials, row_names=False)
    except:
        pass

Putting the upload_stock_data loop in a function so that the kernel can be restart and ran without mandatory for loop execution.

In [31]:
def start():
    x = input('Retrieve current market history for top ' + str(len(ticker_list)) + " stocks held? (Yes/No)")
    if x.lower() == 'yes':
        for ticker in ticker_list:
            upload_stock_data(ticker)

# Tableau Visualizations

In [32]:
%%html
<div class='tableauPlaceholder' id='viz1660003454924' style='position: relative'><noscript><a href='#'><img alt='Dashboard 1 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;ET&#47;ETFPortfolioDash&#47;Dashboard1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='ETFPortfolioDash&#47;Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;ET&#47;ETFPortfolioDash&#47;Dashboard1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1660003454924');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else { vizElement.style.width='100%';vizElement.style.height='727px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

## Runtime statistics

In [33]:
finishtime = datetime.now()

datetime.timedelta(seconds=46, microseconds=150623)

In [37]:
elapsed_time = finishtime - starttime
print("Total runtime: ", elapsed_time)

Total runtime:  0:00:46.150623
