# Yahoo! Finance Scraper
Extract historical stock prices using a hidden Yahoo! Finance api

In [95]:
import re
import json
import csv
import requests
from bs4 import BeautifulSoup
from io import StringIO

pages that we want to scrape

In [2]:
url_stats = 'https://finance.yahoo.com/quote/{}/key-statistics?p={}'
url_profile = 'https://finance.yahoo.com/quote/{}/profile?p={}'
url_financials = 'https://finance.yahoo.com/quote/{}/financials?p={}'
url_analysis = 'https://finance.yahoo.com/quote/{}/analysis?p={}'

target company

In [3]:
stock = 'F'

## Financial Statements

request the data and parse the html

In [4]:
response = requests.get(url_financials.format(stock, stock))
soup = BeautifulSoup(response.text, 'html.parser')

- Extract the json data embedded into the JavaScript code... the content contains the comment "-- Data --". 
- Use regular expressions to find a script tag that matches this text pattern.  
- Assign the contents of the script tag to a `script_data` variable... it returns a list, so grab the first item in the list.

In [120]:
pattern = re.compile(r'\s--\sData\s--\s')
script_data = soup.find('script', text=pattern).contents[0]

The contents will contain embedded json data, but we need to strip the function wrapper from the front and back of this json data string
- Find the word context, and back up two characters
- The json data string ends 12 characters from the end of the contents.
- Use these two factors to slice json data string from the contents, and then parse it to a python dictionary using the `json.loads` function

In [6]:
# the beginning
script_data[:500]

'\n(function (root) {\n/* -- Data -- */\nroot.App || (root.App = {});\nroot.App.now = 1600023091164;\nroot.App.main = {"context":{"dispatcher":{"stores":{"PageStore":{"currentPageName":"quote","currentRenderTargetId":"default","pagesConfigRaw":{"base":{"quote":{"layout":{"bundleName":"yahoodotcom-layout.TwoColumnLayout","name":"TwoColumnLayout","config":{"enableHeaderCollapse":true,"additionalBodyWrapperClasses":"Bgc($layoutBgColor)!","contentWrapperClasses":"Bgc($lv2BgColor)!","Header":{"isFixed":tru'

In [7]:
# the end
script_data[-500:]

'how":{"strings":1},"tdv2-applet-sponsored-moments":{"strings":1},"tdv2-applet-stream":{"strings":1},"tdv2-applet-stream-hero":{"strings":1},"tdv2-applet-swisschamp":{"strings":1},"tdv2-applet-uh":{"strings":1},"tdv2-applet-userintent":{"strings":1},"tdv2-applet-video-lightbox":{"strings":1},"tdv2-applet-video-modal":{"strings":1},"tdv2-wafer-adfeedback":{"strings":1},"tdv2-wafer-header":{"strings":1},"yahoodotcom-layout":{"strings":1}}},"options":{"defaultBundle":"td-app-finance"}}}};\n}(this));\n'

In [8]:
start = script_data.index('context')-2
json_data = json.loads(script_data[start:-12])

Now that you have the json data, you can begin exploring the dictionary by showing the keys, and then finding out what's inside

In [9]:
json_data['context'].keys()

dict_keys(['dispatcher', 'options', 'plugins'])

You can find the financial statements in the *QuoteSummaryStore*

In [10]:
json_data['context']['dispatcher']['stores']['QuoteSummaryStore'].keys()

dict_keys(['financialsTemplate', 'cashflowStatementHistory', 'balanceSheetHistoryQuarterly', 'earnings', 'price', 'incomeStatementHistoryQuarterly', 'incomeStatementHistory', 'balanceSheetHistory', 'cashflowStatementHistoryQuarterly', 'quoteType', 'summaryDetail', 'symbol', 'pageViews'])

This data set contains both quarterly and annual financial statement information for:
- Income Statement
- Balance Sheet
- Statement of Cashflow

In [11]:
annual_is = json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['incomeStatementHistory']['incomeStatementHistory']
quarterly_is = json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['incomeStatementHistoryQuarterly']['incomeStatementHistory']

annual_cf = json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['cashflowStatementHistory']['cashflowStatements']
quarterly_cf = json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['cashflowStatementHistoryQuarterly']['cashflowStatements']

annual_bs = json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['balanceSheetHistory']['balanceSheetStatements']
quarterly_bs = json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['balanceSheetHistoryQuarterly']['balanceSheetStatements']

In [12]:
# income statement example
print(annual_is[0])

{'researchDevelopment': {}, 'effectOfAccountingCharges': {}, 'incomeBeforeTax': {'raw': -640000000, 'fmt': '-640M', 'longFmt': '-640,000,000'}, 'minorityInterest': {'raw': 45000000, 'fmt': '45M', 'longFmt': '45,000,000'}, 'netIncome': {'raw': 47000000, 'fmt': '47M', 'longFmt': '47,000,000'}, 'sellingGeneralAdministrative': {'raw': 10218000000, 'fmt': '10.22B', 'longFmt': '10,218,000,000'}, 'grossProfit': {'raw': 12876000000, 'fmt': '12.88B', 'longFmt': '12,876,000,000'}, 'ebit': {'raw': 2658000000, 'fmt': '2.66B', 'longFmt': '2,658,000,000'}, 'endDate': {'raw': 1577750400, 'fmt': '2019-12-31'}, 'operatingIncome': {'raw': 2658000000, 'fmt': '2.66B', 'longFmt': '2,658,000,000'}, 'otherOperatingExpenses': {}, 'interestExpense': {'raw': -1049000000, 'fmt': '-1.05B', 'longFmt': '-1,049,000,000'}, 'extraordinaryItems': {}, 'nonRecurring': {}, 'otherItems': {}, 'incomeTaxExpense': {'raw': -724000000, 'fmt': '-724M', 'longFmt': '-724,000,000'}, 'totalRevenue': {'raw': 155900000000, 'fmt': '155

In [13]:
print(annual_is[0]['operatingIncome'])

{'raw': 2658000000, 'fmt': '2.66B', 'longFmt': '2,658,000,000'}


Now that you have access to the data, you can extract the pieces that you want using a for loop. For example, I want to grab the **Account** and the **Raw** number in this data set for all balance sheets. Most of these accounts include a dictionary that contains the raw attribute, however, some do not. So, I'll have to include some error handling for when I get either a `TypeError` or a `KeyError`.

In [14]:
annual_is_stmts = []

# consolidate annual
for s in annual_is:
    statement = {}
    for key, val in s.items():
        try:
            statement[key] = val['raw']
        except TypeError:
            continue
        except KeyError:
            continue
    annual_is_stmts.append(statement)

In [15]:
annual_is_stmts[0]

{'incomeBeforeTax': -640000000,
 'minorityInterest': 45000000,
 'netIncome': 47000000,
 'sellingGeneralAdministrative': 10218000000,
 'grossProfit': 12876000000,
 'ebit': 2658000000,
 'endDate': 1577750400,
 'operatingIncome': 2658000000,
 'interestExpense': -1049000000,
 'incomeTaxExpense': -724000000,
 'totalRevenue': 155900000000,
 'totalOperatingExpenses': 153242000000,
 'costOfRevenue': 143024000000,
 'totalOtherIncomeExpenseNet': -3298000000,
 'netIncomeFromContinuingOps': 84000000,
 'netIncomeApplicableToCommonShares': 47000000}

All of the financial statements can be scrapped in this exact way... I can literally just copy and paste this code, and change the variable names from "is" to "bs" and then make sure I'm passing in the correct json data.

In [16]:
annual_cf_stmts = []
quarterly_cf_stmts = []

# annual
for s in annual_cf:
    statement = {}
    for key, val in s.items():
        try:
            statement[key] = val['raw']
        except TypeError:
            continue
        except KeyError:
            continue
    annual_cf_stmts.append(statement)
    
# quarterly
for s in quarterly_cf:
    statement = {}
    for key, val in s.items():
        try:
            statement[key] = val['raw']
        except TypeError:
            continue
        except KeyError:
            continue
    quarterly_cf_stmts.append(statement)

In [17]:
annual_cf_stmts[0]

{'investments': -543000000,
 'changeToLiabilities': 5260000000,
 'totalCashflowsFromInvestingActivities': -13721000000,
 'netBorrowings': -277000000,
 'totalCashFromFinancingActivities': -3129000000,
 'changeToOperatingActivities': 1554000000,
 'netIncome': 47000000,
 'changeInCash': 834000000,
 'endDate': 1577750400,
 'repurchaseOfStock': -237000000,
 'effectOfExchangeRate': 45000000,
 'totalCashFromOperatingActivities': 17639000000,
 'depreciation': 8490000000,
 'otherCashflowsFromInvestingActivities': -152000000,
 'dividendsPaid': -2389000000,
 'changeToInventory': 206000000,
 'changeToAccountReceivables': -816000000,
 'otherCashflowsFromFinancingActivities': -226000000,
 'changeToNetincome': 2898000000,
 'capitalExpenditures': -7632000000}

## Profile
The company profile information can be scraped in exactly the same way, so I'm literally going to copy and paste the code from above. The only thing I need to do is change the url from the Financials to the Profile.

In [20]:
response = requests.get(url_profile.format(stock, stock))
soup = BeautifulSoup(response.text, 'html.parser')
pattern = re.compile(r'\s--\sData\s--\s')
script_data = soup.find('script', text=pattern).contents[0]
start = script_data.index('context')-2
json_data = json.loads(script_data[start:-12])

Similar to before, most of the interesting data is located in the **Quote Summary Store**

In [21]:
json_data['context']['dispatcher']['stores']['QuoteSummaryStore'].keys()

dict_keys(['financialsTemplate', 'price', 'secFilings', 'quoteType', 'calendarEvents', 'summaryDetail', 'symbol', 'assetProfile', 'pageViews'])

Most of the information you'll want from this is in the `assetProfile` key, but the `summaryDetail` and `secFilings` are also worth checking out.

In [84]:
json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['assetProfile'].keys()

dict_keys(['zip', 'sector', 'fullTimeEmployees', 'compensationRisk', 'auditRisk', 'longBusinessSummary', 'city', 'phone', 'state', 'shareHolderRightsRisk', 'compensationAsOfEpochDate', 'governanceEpochDate', 'boardRisk', 'country', 'companyOfficers', 'website', 'maxAge', 'overallRisk', 'address1', 'industry'])

**Company Officers**  
The big deal on this page is the list of company officers; you can get their names, titles, compensation, etc... This is within `assetProfile` and then `companyOfficers`.

In [24]:
json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['assetProfile']['companyOfficers']

[{'totalPay': {'raw': 3661316, 'fmt': '3.66M', 'longFmt': '3,661,316'},
  'exercisedValue': {'raw': 0, 'fmt': None, 'longFmt': '0'},
  'yearBorn': 1957,
  'name': 'Mr. William Clay Ford Jr.',
  'title': 'Exec. Chairman',
  'maxAge': 1,
  'fiscalYear': 2019,
  'unexercisedValue': {'raw': 0, 'fmt': None, 'longFmt': '0'},
  'age': 62},
 {'totalPay': {'raw': 4167237, 'fmt': '4.17M', 'longFmt': '4,167,237'},
  'exercisedValue': {'raw': 0, 'fmt': None, 'longFmt': '0'},
  'yearBorn': 1955,
  'name': 'Mr. James Patrick Hackett',
  'title': 'Pres, CEO & Director',
  'maxAge': 1,
  'fiscalYear': 2019,
  'unexercisedValue': {'raw': 0, 'fmt': None, 'longFmt': '0'},
  'age': 64},
 {'totalPay': {'raw': 4018261, 'fmt': '4.02M', 'longFmt': '4,018,261'},
  'exercisedValue': {'raw': 0, 'fmt': None, 'longFmt': '0'},
  'yearBorn': 1967,
  'name': 'Mr. Timothy R. Stone',
  'title': 'Chief Financial Officer',
  'maxAge': 1,
  'fiscalYear': 2019,
  'unexercisedValue': {'raw': 0, 'fmt': None, 'longFmt': '0'},

**Company Description**  
Also located in the `assetProfile` is a description of the company.

In [90]:
json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['assetProfile']['longBusinessSummary']

'Ford Motor Company designs, manufactures, markets, and services a range of Ford cars, trucks, sport utility vehicles, electrified vehicles, and Lincoln luxury vehicles worldwide. It operates through three segments: Automotive, Mobility, and Ford Credit. The Automotive segment sells Ford and Lincoln vehicles, service parts, and accessories through distributors and dealers, as well as through dealerships to commercial fleet customers, daily rental car companies, and governments. The Mobility segment designs and builds mobility services; and provides self-driving systems development and vehicle integration, autonomous vehicle research and engineering, and autonomous vehicle transportation-as-a-service network development services. The Ford Credit segment primarily engages in vehicle-related financing and leasing activities to and through automotive dealers. It provides retail installment sale contracts for new and used vehicles; and direct financing leases for new vehicles to retail and 

**SEC Filings**  
You can get a list of SEC filings. However, you'll need to make sure you remove the &amp sign if you want to click on the url. However, this does take you directly to the Edgars site where you can get more information.

In [23]:
json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['secFilings']['filings'][10]

{'date': '2020-04-13',
 'epochDate': 1586788296,
 'type': '8-K',
 'title': 'Disclosing Regulation FD Disclosure, Other Events, Financial Statements and Exhibits',
 'edgarUrl': 'https://yahoo.brand.edgar-online.com/DisplayFiling.aspx?TabIndex=2&dcn=0000037996-20-000031&nav=1&src=Yahoo',
 'maxAge': 1}

There's tons of summary information in the `summaryDetail` section. I'll let you explore that on your own, but you can see clearly what is in there.

In [82]:
json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['summaryDetail'].keys()

dict_keys(['previousClose', 'regularMarketOpen', 'twoHundredDayAverage', 'trailingAnnualDividendYield', 'payoutRatio', 'volume24Hr', 'regularMarketDayHigh', 'navPrice', 'averageDailyVolume10Day', 'totalAssets', 'regularMarketPreviousClose', 'fiftyDayAverage', 'trailingAnnualDividendRate', 'open', 'toCurrency', 'averageVolume10days', 'expireDate', 'yield', 'algorithm', 'dividendRate', 'exDividendDate', 'beta', 'circulatingSupply', 'startDate', 'regularMarketDayLow', 'priceHint', 'currency', 'regularMarketVolume', 'lastMarket', 'maxSupply', 'openInterest', 'marketCap', 'volumeAllCurrencies', 'strikePrice', 'averageVolume', 'priceToSalesTrailing12Months', 'dayLow', 'ask', 'ytdReturn', 'askSize', 'volume', 'fiftyTwoWeekHigh', 'forwardPE', 'maxAge', 'fromCurrency', 'fiveYearAvgDividendYield', 'fiftyTwoWeekLow', 'bid', 'tradeable', 'dividendYield', 'bidSize', 'dayHigh'])

## Statistics  

Now that you've see that this is basically the same process, I'll quickly show you that the same applies to the Statistics page, as well. I'll copy and paste the code we used for the profile data.

In [26]:
response = requests.get(url_stats.format(stock, stock))
soup = BeautifulSoup(response.text, 'html.parser')
pattern = re.compile(r'\s--\sData\s--\s')
script_data = soup.find('script', text=pattern).contents[0]
start = script_data.index('context')-2
json_data = json.loads(script_data[start:-12])

Similar to before, the interesting information is in the `QuoteSummaryStore`; specifically in the key called `defaultKeyStatistics`. There are a ton of data points here for you to scrape, so now that you know how to get the data, you can extract whatever you find is most useful to you.

In [27]:
json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['defaultKeyStatistics'].keys()

dict_keys(['annualHoldingsTurnover', 'enterpriseToRevenue', 'beta3Year', 'profitMargins', 'enterpriseToEbitda', '52WeekChange', 'morningStarRiskRating', 'forwardEps', 'revenueQuarterlyGrowth', 'sharesOutstanding', 'fundInceptionDate', 'annualReportExpenseRatio', 'totalAssets', 'bookValue', 'sharesShort', 'sharesPercentSharesOut', 'fundFamily', 'lastFiscalYearEnd', 'heldPercentInstitutions', 'netIncomeToCommon', 'trailingEps', 'lastDividendValue', 'SandP52WeekChange', 'priceToBook', 'heldPercentInsiders', 'nextFiscalYearEnd', 'yield', 'mostRecentQuarter', 'shortRatio', 'sharesShortPreviousMonthDate', 'floatShares', 'beta', 'enterpriseValue', 'priceHint', 'threeYearAverageReturn', 'lastSplitDate', 'lastSplitFactor', 'legalType', 'morningStarOverallRating', 'earningsQuarterlyGrowth', 'priceToSalesTrailing12Months', 'dateShortInterest', 'pegRatio', 'ytdReturn', 'forwardPE', 'maxAge', 'lastCapGain', 'shortPercentOfFloat', 'sharesShortPriorMonth', 'category', 'fiveYearAverageReturn'])

In [28]:
json_data['context']['dispatcher']['stores']['QuoteSummaryStore']['defaultKeyStatistics']

{'annualHoldingsTurnover': {},
 'enterpriseToRevenue': {'raw': 1.266, 'fmt': '1.27'},
 'beta3Year': {},
 'profitMargins': {'raw': -0.016280001, 'fmt': '-1.63%'},
 'enterpriseToEbitda': {'raw': 26.996, 'fmt': '27.00'},
 '52WeekChange': {'raw': -0.24731183, 'fmt': '-24.73%'},
 'morningStarRiskRating': {},
 'forwardEps': {'raw': 0.71, 'fmt': '0.71'},
 'revenueQuarterlyGrowth': {},
 'sharesOutstanding': {'raw': 3907539968,
  'fmt': '3.91B',
  'longFmt': '3,907,539,968'},
 'fundInceptionDate': {},
 'annualReportExpenseRatio': {},
 'totalAssets': {},
 'bookValue': {'raw': 7.748, 'fmt': '7.75'},
 'sharesShort': {'raw': 98362703, 'fmt': '98.36M', 'longFmt': '98,362,703'},
 'sharesPercentSharesOut': {'raw': 0.0247, 'fmt': '2.47%'},
 'fundFamily': None,
 'lastFiscalYearEnd': {'raw': 1577750400, 'fmt': '2019-12-31'},
 'heldPercentInstitutions': {'raw': 0.55063, 'fmt': '55.06%'},
 'netIncomeToCommon': {'raw': -2123000064,
  'fmt': '-2.12B',
  'longFmt': '-2,123,000,064'},
 'trailingEps': {'raw': -

## Historical Stock Data
Finally, let's take a look at how to extract the historical stock data. There's actually a hidden API for this, and originally I was going to show you how to access the json data similar to what we've done already. However, while working on this script. I realized there was a download button the page, which still accesses a hidden api, but it returns csv data instead of json formatted data. And, since the goal was to get this data into a structured csv format anyway... I figured, what's the point of trying to extract via json when I can just get exactly what I want with a single api call. So, I'm going to first show you the call, and then I'll show you how I got there.

In [36]:
stock_url = 'https://query1.finance.yahoo.com/v7/finance/download/F?period1=1568402418&period2=1600024818&interval=1d&events=history'

In [37]:
response = requests.get(stock_url)

In [38]:
with open('stock_data.csv', 'w', newline='', encoding='utf-8') as f:
    f.write(response.text)

Now, it's pretty easy to see this is an api call, with the object "query" words, the api version number, and the string of query parameters. We can actually simplify this api call and make it more customizable by removing the parameters and adding a curly brace for the stock ticker.

In [39]:
stock_url = 'https://query1.finance.yahoo.com/v7/finance/download/{}'

In [40]:
params = {
    'period1':'1568402418',
    'period2':'1600024818',
    'interval': '1d',
    'events': 'history'
}

Now, we can customize this more easily if we want to change any parameters. It's still not especially useful because the period parameters are using timestamps that are not really readable by humans. Fortunately, we can replace those periods with a range parameter. 

In [41]:
params = {
    'range': '5y',
    'interval': '1d',
    'events': 'history'
}

In [42]:
response = requests.get(stock_url.format('F'), params=params)

This time, let's read this into a file object so we can see it on the screen

In [101]:
file = StringIO(response.text)
reader = csv.reader(file)
data = list(reader)

for row in data[:5]:
    print(row)

['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
['2015-09-14', '13.720000', '13.790000', '13.630000', '13.780000', '10.789936', '26093500']
['2015-09-15', '13.800000', '14.370000', '13.790000', '14.310000', '11.204935', '46666700']
['2015-09-16', '14.320000', '14.760000', '14.250000', '14.640000', '11.463331', '41675000']
['2015-09-17', '14.610000', '14.880000', '14.460000', '14.600000', '11.432008', '37709000']


So, how do we know what parameters this will accept and will not accept? They key is to take a look at the network activity on the web page. Go back to the website, and click on the summary page. Open up the developer tools, or right click an then click "Inspect". At the top you'll see a section call "Network Activity". If there are any api calls being made, they will who up here... specifically under the XHR filter. According to the Internet, XHR is a JavaScript object that is used to transfer data between your webrowser, and the webserver. If you click on some of these items, you'll notice that a new screen reveals the headers, and also the response. What I typically do is look for a response that appears to be a json formatted string, such as these. Then, I go to the headers to see what calls I need to make. However, since we aleady know what the api call looks like, we can simply look for it in this list by checking on the request url. You can see that several api calls are being made to the V7 api for different elements on the page. If you look at the parameters, you'll notice that they are slighly different than the one we originally used. Now, this is a hidden API, so they only way you're really going to know how to interact with it is by exploring and experimenting. Now, we know that it accepts a range and interval. So, one clue to finding out what values we can use is to simply look at the filters on the page. Based on what I found, the chart tab seemed to have the best information on what parameters I could use. Then it's a matter of plugging these in and seeing if it actually works.

In [113]:
params = {
    'range': '5y',
    'interval': '1w',
    'events': 'history'
}

In [115]:
response = requests.get(stock_url.format('F'), params=params)

file = StringIO(response.text)
reader = csv.reader(file)
data = list(reader)

for row in data[:5]:
    print(row)

['400 Bad Request: Invalid input - interval=1w is not supported. Valid intervals: [1m', ' 2m', ' 5m', ' 15m', ' 30m', ' 60m', ' 90m', ' 1h', ' 1d', ' 5d', ' 1wk', ' 1mo', ' 3mo]']


Fortunately, this api gives you feedback, which is a little unusual.

In [116]:
params = {
    'range': '5y',
    'interval': '1wk',
    'events': 'history'
}

In [117]:
response = requests.get(stock_url.format('F'), params=params)

file = StringIO(response.text)
reader = csv.reader(file)
data = list(reader)

for row in data[:5]:
    print(row)

['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
['2015-09-07', '13.750000', '13.810000', '13.530000', '13.710000', '10.735126', '22804500']
['2015-09-14', '13.720000', '14.880000', '13.630000', '14.280000', '11.181444', '192856400']
['2015-09-21', '14.200000', '14.430000', '13.270000', '13.530000', '10.594183', '164872900']
['2015-09-28', '13.460000', '14.010000', '13.010000', '13.990000', '10.954370', '189171300']
