# Beautiful Soup and Yahoo! Finance
###### Mike Magruder, 2020/02/19

Yahoo! Finance is the best site to scrape for stock data as of early 2020. That may change one day, but it works for now and lots of people and tools rely on it, so it's a safe bet for a school project for sure.

The URL for a stock quote page is very stable. For example, Apple's quote page is https://finance.yahoo.com/quote/AAPL?p=AAPL

They're famous for having two tables on the quote page that are really, really useful and really, really stable. The tables are obvious when looking at the quote page, and the first has "Previous Close" as the first row and the second starts with "Market Cap."

You can use Beautiful Soup to find all tables, then pick the tables apart and build up something useful in Python. This notebook shows the process.

In [2]:
import requests
from bs4 import BeautifulSoup

In [2]:
r = requests.get('https://finance.yahoo.com/quote/AAPL?p=AAPL')  # Request a quote page for AAPL and check the response code. It should be 200, meaning success.
r.status_code

200

In [3]:
r.text[:100]  # Let's see a little bit of the result

'<!DOCTYPE html><html id="atomic" class="NoJs featurephone" lang="en-US"><head prefix="og: http://ogp'

In [4]:
soup = BeautifulSoup(r.text, 'html.parser')  # Make some soup...
soup.title  # Did it kinda work? 

<title>Apple Inc. (AAPL) Stock Price, Quote, History &amp; News - Yahoo Finance</title>

In [5]:
tables = soup.findAll('table')  # Find those tables... there should be two.
len(tables)

2

In [6]:
t1 = tables[0]
rows = t1.findAll('tr')  # If we have a table, we can find the rows within it. NB: we're calling findAll() on the first table, not the whole soup.
rows[0]

<tr class="Bxz(bb) Bdbw(1px) Bdbs(s) Bdc($seperatorColor) H(36px)" data-reactid="12"><td class="C($primaryColor) W(51%)" data-reactid="13"><span data-reactid="14">Previous Close</span></td><td class="Ta(end) Fw(600) Lh(14px)" data-reactid="15" data-test="PREV_CLOSE-value"><span class="Trsdu(0.3s)" data-reactid="16">319.00</span></td></tr>

In [7]:
cols = rows[0].findAll('td')  # And of course, rows have columns...
cols[0]

<td class="C($primaryColor) W(51%)" data-reactid="13"><span data-reactid="14">Previous Close</span></td>

## Parsing all quote data
So, given that we can find the tables, and the rows and columns within, let's turn them into a dictionary keyed by the description of each value in the table. This will make it really easy to lookup whatever you want later.

NB: every row in each of these tables has just two columns: the name of a value, and the value. We use the name as the key in the dictionary.

In [8]:
stock_quote = {}

for table in soup.findAll('table'):
    for row in table.findAll('tr'):
        cols = row.findAll('td')
        stock_quote[cols[0].text.strip()] = cols[1].text.strip()

stock_quote

{'Previous Close': '319.00',
 'Open': '320.00',
 'Bid': '323.64 x 800',
 'Ask': '323.39 x 1300',
 "Day's Range": '320.00 - 324.54',
 '52 Week Range': '169.50 - 327.85',
 'Volume': '22,677,483',
 'Avg. Volume': '29,879,924',
 'Market Cap': '1.416T',
 'Beta (5Y Monthly)': '1.28',
 'PE Ratio (TTM)': '25.69',
 'EPS (TTM)': '12.60',
 'Earnings Date': 'Apr 27, 2020 - May 03, 2020',
 'Forward Dividend & Yield': '3.08 (0.95%)',
 'Ex-Dividend Date': 'Feb 06, 2020',
 '1y Target Est': '333.31'}

In [9]:
stock_quote['Previous Close']

'319.00'

## Parsing the current price
Finding the live, current price is harder. There isn't a very good marker to go off of to find the right piece of HTML. Using the inspector in Chrome or Firefox you can see that the current price is in a span with very weird class tags, and a data-reactid tag. I recommend against relying on the data-reactid tag as I suspect it could change easily. The class tag is typically more stable, though honestly in this case it seems quite brittle and likley to change as well. Still, we'll go with the class tag.

An alternative to grabbing the actual current price is to use daily prices via the Previous Close above, to get the last price from yesterday, or to use the Ask price, which will be reasonably close to the current price and ought to be good enough for even minute-level data.

In [10]:
soup.findAll('span')[:15]  # Let's look at the first 15 spans, and indeed we can see the current price in there and its class tags.

[<span data-reactid="31">No matching results for ''</span>,
 <span data-reactid="33">Tip: Try a valid symbol or a specific company name for relevant results</span>,
 <span data-reactid="36">Cancel</span>,
 <span data-reactid="9">Summary</span>,
 <span data-reactid="13">Statistics</span>,
 <span data-reactid="17">Historical Data</span>,
 <span data-reactid="21">Profile</span>,
 <span data-reactid="25">Financials</span>,
 <span data-reactid="29">Analysis</span>,
 <span data-reactid="33">Options</span>,
 <span data-reactid="37">Holders</span>,
 <span data-reactid="41">Sustainability</span>,
 <span data-reactid="9">NasdaqGS - NasdaqGS Real Time Price. Currency in USD</span>,
 <span class="Trsdu(0.3s) Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(b)" data-reactid="14">323.62</span>,
 <span class="Trsdu(0.3s) Fw(500) Fz(14px) C($dataGreen)" data-reactid="16">+4.62 (+1.45%)</span>]

In [11]:
current_value = soup.find('span', attrs={'class': 'Trsdu(0.3s) Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(b)'}).text.strip()
current_value

'323.62'