# Financial Screen Scraping

Average investors constantly get sapped by get-rich-quick schemes or the well-meaning advice of friends. Generally lacking in financial literacy, they're prone to fall victim to a good story without doing rigorous analysis on securities. 

The problems with rigorous analysis:
1. It's hard! 
2. It takes too much time! 
3. It's expensive to acquire data from financial institutions!

All of these things are true, which is why we'll use Python and pandas to scrape the web for the needed numbers, apply them to a market tested model, and export the output to Google Sheets! The idea for this project came from a client who was doing these calculations manually and wanted to automate his market analysis. He's been kind enough to allow me to share this with the open source community.

## Dependencies

Before getting started, make sure your environment has all the modules listed below. The ones you don't have are available with a simple :pip install module_name: command (for instance - pip install pygsheets).

This contains everything you'll need to get up and running harvesting financial data!

In [2]:
import pandas as pd #data processing
from tqdm import tqdm #progress bars
import pygsheets #export to Google Sheets
import numpy as np #numerical processing
import datetime #enable datetime manipulation
pd.set_option('max_columns', 500) #more columns makes it easier to work with wide datasets

## The Model

You can work with whatever model your data will permit you to access. We're using a version of [Discounted Cash Flow](https://www.valuespreadsheet.com/free-discounted-cash-flow-calculator-spreadsheet)(DCF) designed by Nick Kraackman, founder of [Value Spreadsheet](https://www.valuespreadsheet.com/). This is a value-investing strategy designed to evaluate long-term plays by valuing a company in terms of forecasted cash flows. A value-investing strategy is ideal for this type of algorithm, as screen scraping will naturally be much slower than High Frequency Trading (HFT) algos.

His formula for DCF is as follows: 

$$DCF= fcf_y*m + \sum_{i=1}^y fcf_i = \frac{fcf_{i-1} * (1 + (g_{i-1}*gd))}{(1+r)}$$



|Variable|Term            |Found On|
|-----|-----|-----|
|DCF     |Discounted Cash Flow  |Calculate|
|fcf     |Free Cash Flow|Cash Flow Statement|
|m|Multiplier|User Defined|
|i|Current Year|Within Calculation|
|y|Max Year|User Defined|
|g|Growth Rate|User Defined|
|gd|Growth Decline Rate|User Defined|
|r|Discount Rate|User Defined|

The "User Defined" variables are assumptions it's up to the investor to make about the time value of money, but **we'll use the numbers that Nick Kraackman suggested in his article** to get started. 

The function below models this process in Python and can be called on its own or within another function. 

In [3]:
def npv_fcf(fcf, discount_rate, years, growth_rate, multiplier, growth_decline_rate, cash_on_hand, total_debt):
    '''Calculates net present value of a free cash flow over a given time period'''
    
    npv = [] #initialize a list
    
    #Add values depending on year, growth rate, growth decline, and discount rate
    for year in range(1, years+1):
        fcf = fcf * (1 + growth_rate)/((1 + discount_rate))
        growth_rate = growth_rate * (1-growth_decline_rate)
        
        final_fcf = fcf
    
        npv.append(fcf)
    
    max_year_fcf_value = final_fcf * multiplier #Year 10NPV Value
    npv.append(max_year_fcf_value) 
    
    npv.append(cash_on_hand)
    npv.append(-total_debt)
    
        
    return sum(npv)  

## Ingesting the Data to Feed the Model

So the web scraping begins here. We grab, clean, and manipulate the needed data with a popular library data scientists use for cleanup: pandas. 

It's a big function, but what it's doing is simple. 

1. It pulls in a dictionary of stock ticker symbols paired with their brand names and categories. 
2. It then scans multiple places on the web for a particular stock and searches through those pages for needed data elements. 
    1. Try replacing the curly brackets in the URLs with the 'AAPL' stock ticker and browsing to those pages on your own to get a sense for the data that it's reading
3. The pd.read_html() function stores the tables on an HTML page as a list of DataFrames, which explains the slicing notation [0] you see after the calls to those functions

The function goes through the process of finding, cleaning, and calculating DCF for each stock and displays a progress bar as it's downloading from the web. 

It may look like a huge function, but that's the nature of having to pick and clean so many individual points from different locations. 

In [4]:
def trackr(stocks, margin_of_safety=.25, discount_rate=.1, growth_decline=.05, years=10, multiplier=12, yahoo_adjust=1000):
    '''trackr reads a dictionary of stocks and their associated company names and categories and returns today's Yahoo Finance 
    stats on the corresponding stock as a row of values that can be appended to a dataframe'''
    
    #Error Handling
    if type(stocks) != dict:
        print('Error. You must enter these in a dictionary format, where the symbol is the key, name is first list value, and the subsequent list values are categories')
        return
    
    tickers = list(stocks.keys()) #convert to list in case single string entered
    quotes = [] #capture each entry as a row
    
    for ticker in tqdm(tickers):
        #Grab appropriate data based on tickers. This stores the appropriate URLs and then pulls data from the pages of Yahoo Finance
        
        quote = 'https://finance.yahoo.com/quote/{}?p-{}'.format(ticker, ticker) #Quote string
        cashflow = 'https://finance.yahoo.com/quote/{}/cash-flow?p={}'.format(ticker, ticker) #Cashflow string
        stat = 'https://finance.yahoo.com/quote/{}/key-statistics?p={}'.format(ticker, ticker) #Key Statistics String
        balance = 'https://finance.yahoo.com/quote/{}/balance-sheet?p={}&.tsrc=fin-srch'.format(ticker, ticker) #Balance Sheet String
        analysis = 'https://finance.yahoo.com/quote/{}/analysis?p={}'.format(ticker, ticker) #Analysis String
        quote_data = pd.read_html(quote) #grab the data that YAHOO has on the given stock
        cashflow_data = pd.read_html(cashflow) #grab the cashflow statement
        stat_data = pd.read_html(stat) #grab key stats
        balance_data = pd.read_html(balance) #grab balance sheet
        analysis_data = pd.read_html(analysis) #grab analysis data

        '''The HTML tables parse into a list. It will take some cleanup to prepare the list for a dataframe
        stock 1 gives us Previous Close, Open, Bid, Ask, Day's Range, 52 Week Range, Volume, and Avg. Volume
        stock 2 gives us Market Cap, Beta (3Y Monthly), PE Ratio (TTM), EPS (TTM), Earnings Date, Forward Dividend & Yield, Ex-Dividend Date, 1y Target Est'''

        stock1 = quote_data[0].transpose() #Transpose data into meaningful arrangement.
        stock1.columns = stock1.iloc[0] #set new header row
        stock1.drop(0, inplace=True) #drop the old

        stock2 = quote_data[1].transpose() #Transpose data into meaningful arrangement
        stock2.columns = stock2.iloc[0] #set new header row
        stock2.drop(0, inplace=True) #drop the old

        #combine them
        stock_cat = pd.concat([stock1, stock2], axis=1)
        
        #grab cashflow data for discounted cashflow calculation
        stock_cat['Cash Flow'] = np.where(cashflow_data[0].iloc[9][1] == '-', 
                                          cashflow_data[0].iloc[9][2],
                                          cashflow_data[0].iloc[9][1])
        stock_cat['CapEx'] = np.where(cashflow_data[0].iloc[11][1] == '-', 
                                          cashflow_data[0].iloc[11][2],
                                          cashflow_data[0].iloc[11][1])
        

        #add them up for Free Cash Flow calculation
        stock_cat['Free Cash Flow'] = int(stock_cat['Cash Flow']) + int(stock_cat['CapEx'])
        
        # Adds cash and cash equivalents with short term investments. Discards blank values that yahoo returns and goes back a year if needed
        
        stock_cat['Cash On Hand'] = np.where(balance_data[0].iloc[2,1] == '-',
                                             int(balance_data[0].iloc[2,2]) + int(balance_data[0].iloc[3,2].replace('-', '0')),
                                             int(balance_data[0].iloc[2,1].replace('-', '0')) + int(balance_data[0].iloc[3,1].replace('-', '0')))
        
        #grabs long term debt, filling in blank values where needed
        stock_cat['Long Term Debt'] = np.where(balance_data[0].iloc[21, 1] == '-',
                                               balance_data[0].iloc[21, 2],
                                               balance_data[0].iloc[21, 1])
        
        #Pulls Yahoo's version of shares outstanding, applying a different multiplier for billions vs millions
        stock_cat['Approx Shares Outstanding'] = np.where(stat_data[8].loc[2,1][-1] == 'B',
                                                          float(stat_data[8].loc[2,1][:-1]) * 1000000000,
                                                          float(stat_data[8].loc[2,1][:-1]) * 1000000)
        
        #pulls Yahoo's compiling of analyst 5 year estimate for growth
        stock_cat['Conservative Analyst 5y'] = (float(analysis_data[5].loc[4, ticker.upper()].split('%')[0])/100)*(1-margin_of_safety)
        
        
        # turn columns to numeric
        num_cols = ['Cash Flow', 'Free Cash Flow', 'CapEx', 'Cash On Hand', 'Long Term Debt', 'Approx Shares Outstanding', 'Conservative Analyst 5y',
                    'Previous Close', 'Open', 'Volume', 'Avg. Volume', 'Beta (3Y Monthly)', 'PE Ratio (TTM)', 'EPS (TTM)', '1y Target Est']
        for col in num_cols:
            stock_cat[col] = np.nan_to_num(pd.to_numeric(stock_cat[col], errors='coerce'))
            
        # convert to appropriate values by multiplying needed cols by 1000
        dollar_cols = ['Cash Flow', 'CapEx', 'Free Cash Flow', 'Cash On Hand', 'Long Term Debt']
        for col in dollar_cols:
            stock_cat[col] = stock_cat[col] * yahoo_adjust
        
        #Calculate net present value of the equity            
        stock_cat['DCF Value'] = npv_fcf(stock_cat['Free Cash Flow'], 
                                         discount_rate, 
                                         years, 
                                         stock_cat['Conservative Analyst 5y'],
                                         multiplier,
                                         growth_decline,
                                         stock_cat['Cash On Hand'],
                                         stock_cat['Long Term Debt']
                                         )
        
        #Compares the Discounted Cash Flow to share price so decisions can be made about what's overvalued and undervalued. Higher than 1 means undervalued
        stock_cat['DCF Share Price'] = stock_cat['DCF Value'] / stock_cat['Approx Shares Outstanding']
        stock_cat['DCF/Price Multiple'] = stock_cat['DCF Share Price']/stock_cat['Previous Close']
        
        #Give the period of reports using cash on hand report as proxy for data not reported
        stock_cat['Reporting Period'] = np.where(balance_data[0].iloc[2,1] == '-',
                                                 balance_data[0].iloc[0,2],
                                                 balance_data[0].iloc[0,1])
        stock_cat['Reporting Period'] = pd.to_datetime(stock_cat['Reporting Period'])
        
        #Preps needed categories for display
        stock_cat['Date'] = pd.to_datetime(datetime.date.today())
        stock_cat['Ticker'] = ticker.upper()
        stock_cat['Name'] = stocks[ticker][0]
        stock_cat['Categories'] = [stocks[ticker][1:]]

        #reorder the columns for more meaningful view
        stock_cat = stock_cat[list(stock_cat.columns[-4:]) + list(stock_cat.columns)[:-4]]
        
        #finally, add the entry to the list
        quotes.append(stock_cat.iloc[0])
    
    stock_frame = pd.DataFrame(quotes) #Turn these values into a dataframe
    stock_frame.replace('N/A (N/A)', np.nan, inplace=True) #properly indicate null values"
    stock_frame.set_index('Date', drop=True, inplace=True) #and prepare datetime indexing for better analysis
    
    return stock_frame #return a dataframe reflecting each entry

## Grabbing the Stocks and Presenting the Data

The dictionary below contains the stocks my client requested to track. Using a list to store the company name and multiple categories was simple enough to get the job done and keep him apprised of all the data he wanted on each company, as his goal is to sort and filter on his own criteria in a Google Sheet. 

In [5]:
#Dictionary of stocks requested to track. Key is ticker symbol, list position 1 is name, and subsequent positions are categories to which the company belongs.

br_stocks = {'aapl': ['Apple', 'Cell Phone'], 
             '005930.KS': ['Samsung', 'Cell Phone'],
             't': ['ATT', 'Cell Service Provider', 'ISP'],
             'vz': ['Verizon', 'Cell Service Provider'],
             'cmcsa': ['Comcast', 'ISP'],
             'chtr': ['Charter', 'ISP'],
             'duk': ['Duke Energy', 'Power Provider'],
             'ngg': ['National Grid', 'Power Provider'],
             'so': ['Southern Company', 'Power Provider'],
             'googl': ['Google', 'Search'],
             'fb': ['Facebook', 'Advertising'],
             'twtr': ['Twitter', 'Twitter'],
             'gsk': ['Glaxosmithkline', 'Toothpaste'],
             'cl': ['Colgate-Palmolive', 'Toothpaste'],
             'ul': ['Unilever', 'Soap'],
             'ip': ['International Paper', 'Paper'],
             'RDS-A': ['Royal Dutch Shell', 'Gas'],
             'xom': ['Exxon', 'Gas'],
             'cvx': ['Chevron', 'Gas'],
             'pg': ['Proctor & Gamble', 'Detergent'],
             'chd': ['Church and Dwight', 'Detergent'],
             'kdp': ['Kuerig Dr. Pepper', 'Coffee'],
             'sjm': ['JM Smucker Company', 'Coffee'],
             'sbux': ['Starbucks', 'Coffee'],
             'amzn': ['Amazon', 'Retail', 'Cloud Storage'],
             'wmt': ['Wal Mart', 'Retail', 'Grocery'],
             'tgt': ['Target', 'Retail'],
             'kr': ['Kroger', 'Kroger'],
             'msft': ['Microsoft', 'Cloud Storage'],
             'ibm': ['IBM', 'Cloud Storage'],
             'crm': ['Salesforce', 'Cloud Storage']}

Here, we call the function based on his requests.

In [6]:
dcf_trackr = trackr(br_stocks)

100%|██████████| 31/31 [01:49<00:00,  5.98s/it]


## Presenting to Google Sheets

The most important part is presenting the data in a way that's usable to the client. His preference was to receive the data in Google Sheets.

This called for reducing the DataFrame to the specific columns the client needed to see and establishing an easy way to understand how to make a decision based on the DCF. That column is the DCF/Price multiple, which simply divides the current DCF value price by the previous day's stock price. This is a "Higher is Better" scenario, and any value greater than 1 means the DCF is telling us the company may be undervalued, and thus a potential buy. 

In [6]:
quotes = dcf_trackr[['Ticker', 'Name', 'Categories', 'Previous Close', 'Market Cap', 'Cash Flow', 
                     'CapEx', 'Free Cash Flow', 'Cash On Hand', 'Long Term Debt', 'Approx Shares Outstanding', 
                     'Conservative Analyst 5y', 'DCF Value', 'DCF Share Price', 'DCF/Price Multiple', 'Reporting Period']]

We then publish to Google Sheets using pygsheets in a process outlined [here](https://pygsheets.readthedocs.io/en/stable/).

In [8]:
#Publish to Google Sheets
sheet_auth = ''
sheet_name = 'y_finance_stocks'
gc = pygsheets.authorize(service_file=sheet_auth)
sh = gc.open(sheet_name)
wks = sh[0]
wks.set_dataframe(quotes.reset_index(), (1,1)) #specifies cell coordinates of upper leftmost cell

The client is looking at the stocks that DCF indicates are undervalued and performing further analysis to make a decision on purchases, and with this script, you can too!

A few use cases that could be helpful:
- Copy the script and input stocks that you would like to track
- Use jupyter nbconvert and argparser to save this as an executable you can run from the command line
- Automate the operation in a way that you see fit
- Find more models and hunt for more data on the web, even breaking out of what's on Yahoo Finance!