# Data Collection from Yahoo Finance

We start the process from [Yahoo Finance](https://finance.yahoo.com/), one of the best sources for public market data. There are many other sources we could have used as Yahoo Finance is restricting access to their data since 2018. The premium account of Yahoo Finance, currently cost around $33 per month, would allow us to access more complete historical data.

Sources and data startups such as SimFin, Google Finance and other API services might provide more flexibility for our data collection as Yahoo current only allows us to look back 4 years or 5 quarters in history with company's financial statements. But, out the respect for Yahoo Finance over the years, we decided to anchor our first financial data project here.

To test our tools, we will be using some fan favoriate stocks to demonstrate the functionalities.

## The Imports

In [111]:
import lxml
from lxml import html
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import numpy as np
import pandas as pd
from time import sleep
import csv

## Data Collection: Function to Scrape Balance Sheet Information from Yahoo Fianance

The first tool we are going to create is a function to collect the balance sheet information for any publicly traded comapny in the world. This function is very easy to use and takes in only one argument: the stock ticker. This function will return a very clean DataFrame, as shown in the exmaple below.

In [2]:
# The following code was inspired by
# https://www.mattbutton.com/2019/01/24/how-to-scrape-yahoo-finance-and-extract-fundamental-stock-market-data-using-python-lxml-and-pandas/
# The first function gets the cash flow statements for publicly traded companies

def get_balance_sheet(ticker):
    
    # Set base url
    bs_url = 'https://finance.yahoo.com/quote/' + ticker + '/balance-sheet?p=' + ticker
    
    # Set up the request headers that we're going to use, to simulate
    # a request by the Chrome browser. Simulating a request from a browser
    # is generally good practice when building a scraper
    headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Cache-Control': 'max-age=0',
    'Pragma': 'no-cache',
    'Referrer': 'https://google.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
    
    # Fetch the page that we are going to parse, using the request headers
    # Parse the page with LXML, so that we can start doing some XPATH queries
    page = requests.get(bs_url, headers)
    tree = html.fromstring(page.content)
    
    # Get the data and put it in a pandas DataFrame
    table_rows = tree.xpath("//div[contains(@class, 'D(tbr)')]")
    assert len(table_rows) > 0

    parsed_rows = []

    for table_row in table_rows:
        parsed_row = []
        el = table_row.xpath("./div")

        none_count = 0

        for rs in el:
            try:
                (text,) = rs.xpath('.//span/text()[1]')
                parsed_row.append(text)
            except ValueError:
                parsed_row.append(np.NaN)
                none_count += 1

        if (none_count < 4):
            parsed_rows.append(parsed_row)

    df = pd.DataFrame(parsed_rows)
    
    # Do some initial cleaning of the DataFrame and turn this into a time series
    df = df.set_index(0).T
    df.rename(columns={'Breakdown' : 'Date'}, inplace=True)
    df['Date'] = pd.to_datetime(df['Date'])
    df.set_index('Date', inplace=True)
    df.sort_index(inplace=True)
    
    # Rename columns with the same names
    # e.g. Deferred revenue are often times classified differently as current and non-current liabilities
    cols = pd.Series(df.columns)
    for dup in df.columns[df.columns.duplicated()].unique():
        cols[df.columns.get_loc(dup)] = np.array([dup + ' ' + str(d_idx) if d_idx != 0
                                                  else dup
                                                  for d_idx in range(df.columns.get_loc(dup).sum())])
    df.columns=cols
    
    # Change the the strings to floats
    for col in df.columns:
        df[col] = df[col].str.replace(',','')
        df[col] = df[col].astype(np.float64)
    
    # Add the ticker and company name to the DataFrame
    df['Ticker'] = ticker
    name = tree.xpath("//h1/text()")[0].strip()
    index = tree.xpath("//h1/text()")[0].index(' - ')
    df['Company'] = name[index + 3:]
    
    # Finally, return the DataFrame
    return df

In [217]:
get_balance_sheet('AAPL')

Unnamed: 0_level_0,Cash And Cash Equivalents,Short Term Investments,Total Cash,Net Receivables,Inventory,Other Current Assets,Total Current Assets,"Gross property, plant and equipment",Accumulated Depreciation,"Net property, plant and equipment",...,Other long-term liabilities,Total non-current liabilities,Total Liabilities,Common Stock,Retained Earnings,Accumulated other comprehensive income,Total stockholders' equity,Total liabilities and stockholders' equity,Ticker,Company
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-09-29,20484000.0,46671000.0,67155000.0,15754000.0,2132000.0,8283000.0,106869000.0,61245000.0,-34235000.0,27010000.0,...,10055000.0,114431000.0,193437000.0,31251000.0,96364000.0,634000.0,128249000.0,321686000.0,AAPL,Apple Inc.
2017-09-29,20289000.0,53892000.0,74181000.0,17874000.0,4855000.0,13936000.0,128645000.0,75076000.0,-41293000.0,33783000.0,...,8911000.0,140458000.0,241272000.0,35867000.0,98330000.0,-150000.0,134047000.0,375319000.0,AAPL,Apple Inc.
2018-09-29,25913000.0,40388000.0,66301000.0,23186000.0,3956000.0,12087000.0,131339000.0,90403000.0,-49099000.0,41304000.0,...,11165000.0,141712000.0,258578000.0,40201000.0,70400000.0,-3454000.0,107147000.0,365725000.0,AAPL,Apple Inc.
2019-09-29,48844000.0,51713000.0,100557000.0,22926000.0,4106000.0,12352000.0,162819000.0,95957000.0,-58579000.0,37378000.0,...,20958000.0,142310000.0,248028000.0,45174000.0,45898000.0,-584000.0,90488000.0,338516000.0,AAPL,Apple Inc.


## Data Collection: Function to Scrape Income Statement from Yahoo Finance

In a similar fashion, we will need the data from the income statement as well. We create a tool that, again, only requires one argument to perform the task. Demonstration shown below.

In [3]:
# The second function will get the income statement of publicly traded companies

def get_income_statement(ticker):
    
    # Set url for income statement
    is_url = 'https://finance.yahoo.com/quote/' + ticker + '/financials?p=' + ticker
    
    # Header to simulate a Chrome request
    headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Cache-Control': 'max-age=0',
    'Pragma': 'no-cache',
    'Referrer': 'https://google.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
    
    # Get the page contnet
    page = requests.get(is_url, headers)
    tree = html.fromstring(page.content)
    
    # Find the table we need and parse out the data
    table_rows = tree.xpath("//div[contains(@class, 'D(tbr)')]")
    assert len(table_rows) > 0
    parsed_rows = []

    for table_row in table_rows:
        parsed_row = []
        el = table_row.xpath('./div')

        none_count = 0
        for rs in el:
            try:
                (text,) = rs.xpath(".//span/text()[1]")
                parsed_row.append(text)
            except ValueError:
                parsed_row.append(np.NaN)
                none_count += 1
        if none_count < 5:
            parsed_rows.append(parsed_row)
    
    # Create the DataFrame
    df = pd.DataFrame(parsed_rows)
    
    # Do some cleanings
    df = df.set_index(0).T
    df = df.drop(1)
    df = df.rename(columns={'Breakdown' : 'Date'})
    df['Date'] = pd.to_datetime(df['Date'])
    df.set_index('Date', inplace=True)
    df.sort_index(inplace=True)
    
    # Rename columns with the same names
    # e.g. Deferred revenue are often times classified differently as current and non-current liabilities
    cols = pd.Series(df.columns)
    for dup in df.columns[df.columns.duplicated()].unique():
        cols[df.columns.get_loc(dup)] = np.array([dup + ' ' + str(d_idx) if d_idx != 0
                                                  else dup
                                                  for d_idx in range(df.columns.get_loc(dup).sum())])
    df.columns=cols
    
    # Change data types from string to float
    for col in df.columns:
        df[col] = df[col].str.replace(',','')
        df[col] = df[col].astype(np.float64)
        
    # Add the ticker and company name to the DataFrame
    df['Ticker'] = ticker
    name = tree.xpath("//h1/text()")[0].strip()
    index = tree.xpath("//h1/text()")[0].index(' - ')
    df['Company'] = name[index + 3:]
    
    return df

In [219]:
get_income_statement('GOOG')

Unnamed: 0_level_0,Total Revenue,Cost of Revenue,Gross Profit,Research Development,Selling General and Administrative,Total Operating Expenses,Operating Income or Loss,Interest Expense,Total Other Income/Expenses Net,Income Before Tax,Income Tax Expense,Income from Continuing Operations,Net Income,Net Income available to common shareholders,Basic,Diluted,EBITDA,Ticker,Company
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2015-12-30,74989000.0,28164000.0,46825000.0,12282000.0,15183000.0,27465000.0,19360000.0,104000.0,-604000.0,19651000.0,3303000.0,16348000.0,16348000.0,15826000.0,684626.0,692930.0,24423000.0,GOOG,Alphabet Inc.
2016-12-30,90272000.0,35138000.0,55134000.0,13948000.0,17470000.0,31418000.0,23716000.0,124000.0,-662000.0,24150000.0,4672000.0,19478000.0,19478000.0,19478000.0,687778.0,698706.0,29860000.0,GOOG,Alphabet Inc.
2017-12-30,110855000.0,45583000.0,65272000.0,16625000.0,19765000.0,36390000.0,28882000.0,109000.0,-2736000.0,27193000.0,14531000.0,12662000.0,12662000.0,12662000.0,692901.0,703584.0,35797000.0,GOOG,Alphabet Inc.
2018-12-30,136819000.0,59549000.0,77270000.0,21419000.0,24459000.0,45878000.0,31392000.0,114000.0,-5071000.0,34913000.0,4177000.0,30736000.0,30736000.0,30736000.0,695140.0,703285.0,40427000.0,GOOG,Alphabet Inc.


## Data Collection: Function to Scrape Cash Flow Statement from Yahoo Finance

Same thing as above, a powerful and easy-to-use tool for the cash flow statement.

In [4]:
def get_cash_flow(ticker):
    
    # Set url for income statement
    cf_url = 'https://finance.yahoo.com/quote/' + ticker + '/cash-flow?p=' + ticker
    
    # Header to simulate a Chrome request
    headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Cache-Control': 'max-age=0',
    'Pragma': 'no-cache',
    'Referrer': 'https://google.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
    
    # Get the page contnet
    page = requests.get(cf_url, headers)
    tree = html.fromstring(page.content)
    
    # Find the table we need and parse out the data
    table_rows = tree.xpath("//div[contains(@class, 'D(tbr)')]")
    assert len(table_rows) > 0
    parsed_rows = []

    for table_row in table_rows:
        parsed_row = []
        el = table_row.xpath('./div')

        none_count = 0
        for rs in el:
            try:
                (text,) = rs.xpath(".//span/text()[1]")
                parsed_row.append(text)
            except ValueError:
                parsed_row.append(np.NaN)
                none_count += 1
        if none_count < 5:
            parsed_rows.append(parsed_row)
    
    # Create the DataFrame
    df = pd.DataFrame(parsed_rows)
    
    # Do some cleanings
    df = df.set_index(0).T
    df = df.drop(1)
    df = df.rename(columns={'Breakdown' : 'Date'})
    df['Date'] = pd.to_datetime(df['Date'])
    df.set_index('Date', inplace=True)
    df.sort_index(inplace=True)
    
    # Rename columns with the same names
    # e.g. Deferred revenue are often times classified differently as current and non-current liabilities
    cols = pd.Series(df.columns)
    for dup in df.columns[df.columns.duplicated()].unique():
        cols[df.columns.get_loc(dup)] = np.array([dup + ' ' + str(d_idx) if d_idx != 0
                                                  else dup
                                                  for d_idx in range(df.columns.get_loc(dup).sum())])
    df.columns=cols
    
    # Change data types from string to float
    for col in df.columns:
        df[col] = df[col].str.replace(',','')
        df[col] = df[col].astype(np.float64)
        
    # Add the ticker and company name to the DataFrame
    df['Ticker'] = ticker
    name = tree.xpath("//h1/text()")[0].strip()
    index = tree.xpath("//h1/text()")[0].index(' - ')
    df['Company'] = name[index + 3:]
    
    return df

In [220]:
get_cash_flow('IBM')

Unnamed: 0_level_0,Net Income,Depreciation & amortization,Deferred income taxes,Stock based compensation,Change in working capital,Accounts receivable,Inventory,Accounts Payable,Other working capital,Other non-cash items,...,Other financing activites,Net cash used privided by (used for) financing activities,Net change in cash,Cash at beginning of period,Cash at end of period,Operating Cash Flow,Capital Expenditure,Free Cash Flow,Ticker,Company
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-12-30,13190000.0,3855000.0,1387000.0,468000.0,-2444000.0,200000.0,133000.0,81000.0,12857000.0,-200000.0,...,321000.0,-9165000.0,-790000.0,8476000.0,7686000.0,17008000.0,-4151000.0,12857000.0,IBM,International Business Machines Corporation
2016-12-30,11872000.0,4381000.0,-1132000.0,544000.0,1231000.0,1218000.0,-14000.0,197000.0,12808000.0,544000.0,...,204000.0,-5791000.0,140000.0,7686000.0,7826000.0,16958000.0,-4150000.0,12808000.0,IBM,International Business Machines Corporation
2017-12-30,5753000.0,4541000.0,-931000.0,534000.0,6813000.0,419000.0,18000.0,47000.0,12951000.0,1000.0,...,174000.0,-6417000.0,4146000.0,7826000.0,11972000.0,16724000.0,-3773000.0,12951000.0,IBM,International Business Machines Corporation
2018-12-30,8728000.0,4480000.0,853000.0,510000.0,554000.0,-345000.0,-127000.0,126000.0,11283000.0,-1000.0,...,112000.0,-10470000.0,-630000.0,12234000.0,11604000.0,15247000.0,-3964000.0,11283000.0,IBM,International Business Machines Corporation


## Data Collection: Get All Three Statements in a Combined Dataset 

Since we will be collecting a large volume of data at scale, we have to have a function that collects data from all three financial statements and combine them in a clean and meaningful fashion.

In [5]:
def get_financials(ticker):
    
    # Get all three statements and add a prefix
    df_bs = get_balance_sheet(ticker)
    df_is = get_income_statement(ticker)
    df_cf = get_cash_flow(ticker)
    
    # Get the Ticker and Company out as a single dataframe
    df_company = df_bs[['Ticker', 'Company']]
    
    # Drop the Ticker and Company columns and add a prefix on each DataFrame
    df_bs = df_bs.drop(columns=['Ticker', 'Company']).add_prefix('(BS) ')
    df_is = df_is.drop(columns=['Ticker', 'Company']).add_prefix('(IS) ')
    df_cf = df_cf.drop(columns=['Ticker', 'Company']).add_prefix('(CF) ')
    
    # Concate that together into one big DataFrame
    big_df = pd.concat(objs=[df_company, df_bs, df_is, df_cf], axis=1)
    
    # Return the combined DataFrame
    return big_df

In [221]:
get_financials('BA')

Unnamed: 0_level_0,Ticker,Company,(BS) Cash And Cash Equivalents,(BS) Short Term Investments,(BS) Total Cash,(BS) Net Receivables,(BS) Inventory,(BS) Other Current Assets,(BS) Total Current Assets,"(BS) Gross property, plant and equipment",...,(CF) Common stock repurchased,(CF) Dividends Paid,(CF) Other financing activites,(CF) Net cash used privided by (used for) financing activities,(CF) Net change in cash,(CF) Cash at beginning of period,(CF) Cash at end of period,(CF) Operating Cash Flow,(CF) Capital Expenditure,(CF) Free Cash Flow
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-12-30,BA,The Boeing Company,11302000.0,750000.0,12052000.0,8003000.0,47257000.0,212000.0,68234000.0,28362000.0,...,-6751000.0,-2490000.0,61000.0,-7920000.0,-431000.0,11733000.0,11302000.0,9363000.0,-2450000.0,6913000.0
2016-12-30,BA,The Boeing Company,8801000.0,1228000.0,10029000.0,7804000.0,43199000.0,428000.0,62488000.0,29690000.0,...,-7001000.0,-2756000.0,-117000.0,-9587000.0,-2501000.0,11302000.0,8801000.0,10499000.0,-2613000.0,7886000.0
2017-12-30,BA,The Boeing Company,8813000.0,1179000.0,9992000.0,9763000.0,44344000.0,309000.0,65161000.0,30313000.0,...,-9236000.0,-3417000.0,-132000.0,-11350000.0,12000.0,8801000.0,8813000.0,13344000.0,-1870000.0,11474000.0
2018-12-30,BA,The Boeing Company,7637000.0,927000.0,8564000.0,3936000.0,62567000.0,2335000.0,87830000.0,31213000.0,...,-9000000.0,-3946000.0,-222000.0,-11722000.0,-1074000.0,8887000.0,7637000.0,15322000.0,-1791000.0,13531000.0


## Data Collection: Read in the S&P 500 Sector Data

The goal here is collect data from all S&P 500 companies. In order to make this happen, we will need:

- A list of the 500 companies tracked in the index
- Their stock tickers
- Sector and industry information

Note that S&P 500 constantly changes the tracking companies based on market conditions. The index aims to track companies that are most representative of overall market performances. Therefore, the S&P 500 categorizes companies into 11 sectors and there are different industries under each sector. 

This list is updated constantly on the official sites and Wikipedia. But because of time constraints, we elected to use the clean and formatted list provided by [data hub](https://datahub.io/core/s-and-p-500-companies-financials#data-cli). Please note that this list is last updated in 2018 and there could be many changes since then. Also, this list doesn't contain information on the industries under each sector. These should be noted for the next phase of the project.

In [9]:
# Download the S&P 500 data from data hub and read it in.
spx = pd.read_csv('./data/constituents_csv.csv')

In [10]:
spx.head()

Unnamed: 0,Symbol,Name,Sector
0,MMM,3M Company,Industrials
1,AOS,A.O. Smith Corp,Industrials
2,ABT,Abbott Laboratories,Health Care
3,ABBV,AbbVie Inc.,Health Care
4,ACN,Accenture plc,Information Technology


## Data Collection: Scraping Current Company Statistsics from Yahoo Finance

In addition to financial statements, we are also interested in market statistics of the companies in scope. Yahoo Finance's free account only provides access to one year of such statistics. 

To access more data, consider using the SimFin API, Yahoo Finance's premium account or other services.

In [153]:
def get_stats(ticker):
    
    # Get the url and scrape the content from the site.
    url = 'https://finance.yahoo.com/quote/' + ticker + '/key-statistics?p=' + ticker
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-US,en;q=0.9',
        'Cache-Control': 'max-age=0',
        'Pragma': 'no-cache',
        'Referrer': 'https://google.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
    page = requests.get(url, headers)
    tree = BeautifulSoup(page.content, 'lxml')
    
    # Find all the tables using Beautiful Soup, a different method from what we used before
    # But the gist is the same
    tables = tree.find_all('table', {'class' : 'W(100%)'})
    
    # Parse all the data out and put them in a pandas DataFrame
    rows = []
    for table in tables:
        for row in table.find_all('tr'):
            parsed_row = []
            el = row.find_all('td')
            for rs in el:
                parsed_row.append(rs.text)
            rows.append(parsed_row)
    df = pd.DataFrame(rows)
    
    # Get the strange number suffix out and return a clean stat name column
    for index, item in enumerate(list(df[0])):
        try:
            int(item[-1])
            df.loc[index, 0] = item[:-1].strip()
        except:
            df.loc[index, 0] = item.strip()
    
    # Clean the data and get them 
    for index, item in enumerate(list(df[1])):       
        # Strip the string clean.
        item.strip()
        # Replace any comma in the any of the cells. This gives us cleaner data to work with
        if ',' in item:
            item = item.replace(',','')
        # Clean all numeric data, NaNs, factors and dates and put them in their correct format
        if item[-1] in ('T', 'B', 'M', 'k', '%'):
            if item[-1] == 'T':
                df.loc[index,1] = pd.to_numeric(item[:-1]) * 1_000_000_000_000
            elif item[-1] == 'B':
                df.loc[index,1] = pd.to_numeric(item[:-1]) * 1_000_000_000
            elif item[-1] == 'M':
                df.loc[index,1] = pd.to_numeric(item[:-1]) * 1_000_000
            elif item[-1] == 'k':
                df.loc[index,1] = pd.to_numeric(item[:-1]) * 1_000
            elif item[-1] == '%':
                try:
                    df.loc[index,1] = pd.to_numeric(item[:-1]) * 0.01
                except:
                    df.loc[index,1] = np.NaN
            else:
                print('Something is terribly wrong..')
        elif item == 'N/A':
            df.loc[index,1] = np.NaN
        elif '/' in item:
            df.loc[index,1] = item
        else:
            try:
                float(item)
                df.loc[index,1] = pd.to_numeric(item)
            except:
                pd.to_datetime(item)
                df.loc[index,1] = pd.to_datetime(item)
        
    # Reformat the table and transpose it so we can stack stats for different companies together
    df = df.set_index(0).T
    
    # Change the columns to correct data types
    for col in df.columns:
        try:
            pd.to_numeric(df[col])
            df[col] = pd.to_numeric(df[col])
        except:
            continue
    
    # Finally, add the Ticker back in as an indicator.
    df.insert(0, 'Ticker', ticker)
    
    # Return the DataFrame
    return df

In [222]:
get_stats('SBUX')

Unnamed: 0,Ticker,Market Cap (intraday),Enterprise Value,Trailing P/E,Forward P/E,PEG Ratio (5 yr expected),Price/Sales (ttm),Price/Book (mrq),Enterprise Value/Revenue,Enterprise Value/EBITDA,...,Diluted EPS (ttm),Quarterly Earnings Growth (yoy),Total Cash (mrq),Total Cash Per Share (mrq),Total Debt (mrq),Total Debt/Equity (mrq),Current Ratio (mrq),Book Value Per Share (mrq),Operating Cash Flow (ttm),Levered Free Cash Flow (ttm)
1,SBUX,100860000000.0,108360000000.0,29.25,24.9,2.64,3.8,,4.09,20.2,...,2.92,0.063,2760000000.0,2.34,11240000000.0,,0.92,-5.26,5050000000.0,3680000000.0


## Data Collection: Collect Financial Data At Scale (By Sector)

### Get the List of Companies for Each Sector

In [13]:
# Look at the sector distributions.
spx['Sector'].value_counts()

Consumer Discretionary        84
Information Technology        70
Financials                    68
Industrials                   67
Health Care                   61
Consumer Staples              34
Real Estate                   33
Energy                        32
Utilities                     28
Materials                     25
Telecommunication Services     3
Name: Sector, dtype: int64

In [62]:
# Create a list of companies in each sector.
cons_staple_list = list(spx.loc[spx['Sector'] == 'Consumer Staples']['Symbol'])
cons_disc_list = list(spx.loc[spx['Sector'] == 'Consumer Discretionary']['Symbol'])
info_tect_list = list(spx.loc[spx['Sector'] == 'Information Technology']['Symbol'])
fin_list = list(spx.loc[spx['Sector'] == 'Financials']['Symbol'])
indus_list = list(spx.loc[spx['Sector'] == 'Industrials']['Symbol'])
heath_care_list = list(spx.loc[spx['Sector'] == 'Health Care']['Symbol'])
real_estate_list = list(spx.loc[spx['Sector'] == 'Real Estate']['Symbol'])
energy_list = list(spx.loc[spx['Sector'] == 'Energy']['Symbol'])
utl_list = list(spx.loc[spx['Sector'] == 'Utilities']['Symbol'])
materials_list = list(spx.loc[spx['Sector'] == 'Materials']['Symbol'])
telecom_list = list(spx.loc[spx['Sector'] == 'Telecommunication Services']['Symbol'])

### Create a Function that Grabs the Financial Statements in Bulk and Store Them Into a Dictionary

Now comes the fun part, we are looking to get our at scale but we have to do so responsibly without hitting Yahoo's server with too high a frequency.

There will be companies that doesn't come through for the first time around because of some internal algorithms designed by Yahoo to prevent such scraping. 

Therefore, every time we go through the loop, we will be documenting companies that didn't come through and go through another loop with the hope to get them the second time around. We will be using a total of three loops to ensure that we get as many companies as possible.

In addition to these loops, we will be designing a random sleep time in between each server request. This should help us get data more responsibly. 

Overall, we were able to collect 474 companies out of 500. I call this a success.

In [75]:
def get_bulk_financials(company_list):

    # Initiate a bunch of objects that we will need to use later on.
    df_dict = {}
    error_items = []
    error_items_second_try = []
    error_items_third_try = []
    success_list = []
    
    # First scrape. This should give us the majority of the companies we want.
    for company in company_list:
        try:
            df_dict[company] = get_financials(company)
            success_list.append(company)
        except Exception as e:
            print('Failed:', company, type(e))
            error_items.append(company)
        sleep(np.random.choice([5,7,8,12,18]))
    
    if len(error_items) > 0:
        print('------------------------------------------')
        print('Now we are trying a second round for the error items due to potential server-side detections.')
        sleep(60) 
        # Second scrape after a 60 second break. 
        # This should give us a few that we weren't be able to scrape the first time around.
        for company in error_items:
            try:
                df_dict[company] = get_financials(company)
                success_list.append(company)
            except Exception as e:
                error_items_second_try.append(company)
                print('Failed:', company, type(e))
            sleep(np.random.choice([4,9,10,11,20]))
        
        if len(error_items_second_try) > 0:
            print('------------------------------------------')
            print('Now we are trying a third round for the error items due to potential server-side detections.')
            sleep(60)    
            # Third scrape after another 60 second break.
            # Hopefully, we will get everything we need.
            for company in error_items_second_try:
                try:
                    df_dict[company] = get_financials(company)
                    success_list.append(company)
                except Exception as e:
                    error_items_third_try.append(company)
                    print('Failed:', company, type(e))
                sleep(np.random.choice([4,9,10,11,20]))
    
    # Give us some final summary.
    if len(error_items_third_try) > 0:
        print('Successful srape for the following companies:', success_list)
        print('Failed to scrape information for the following companies:', error_items_third_try)
        print('Please try again later...')
    else:
        print('All companies scraped!')
    
    # Return the success_list and the dictionary for our DataFrames.
    return success_list, df_dict

### Start to Scrape Each of the 11 Sectors Separately

In [69]:
telecom_list, telecom_df_dict = get_bulk_financials(telecom_list)

All companies scraped!


In [76]:
materials_list, materials_df_dict = get_bulk_financials(materials_list)

Failed: DWDP <class 'AssertionError'>
Failed: MON <class 'AssertionError'>
Failed: PX <class 'AssertionError'>
------------------------------------------
Now we are trying a second round for the error items due to potential server-side detections.
Failed: DWDP <class 'AssertionError'>
Failed: MON <class 'AssertionError'>
Failed: PX <class 'AssertionError'>
------------------------------------------
Now we are trying a third round for the error items due to potential server-side detections.
Failed: DWDP <class 'AssertionError'>
Failed: MON <class 'AssertionError'>
Failed: PX <class 'AssertionError'>
Successful srape for the following companies: ['APD', 'ALB', 'AVY', 'BLL', 'CF', 'EMN', 'ECL', 'FMC', 'FCX', 'IP', 'IFF', 'LYB', 'MLM', 'NEM', 'NUE', 'PKG', 'PPG', 'SEE', 'SHW', 'MOS', 'VMC', 'WRK']
Failed to scrape information for the following companies: ['DWDP', 'MON', 'PX']
Please try again later...


In [77]:
utl_list, utl_df_dict = get_bulk_financials(utl_list)

Failed: AWK <class 'AssertionError'>
Failed: NEE <class 'AssertionError'>
Failed: SCG <class 'AssertionError'>
------------------------------------------
Now we are trying a second round for the error items due to potential server-side detections.
Failed: SCG <class 'AssertionError'>
------------------------------------------
Now we are trying a third round for the error items due to potential server-side detections.
Failed: SCG <class 'AssertionError'>
Successful srape for the following companies: ['AES', 'LNT', 'AEE', 'AEP', 'CNP', 'CMS', 'ED', 'D', 'DTE', 'DUK', 'EIX', 'ETR', 'ES', 'EXC', 'FE', 'NI', 'NRG', 'PCG', 'PNW', 'PPL', 'PEG', 'SRE', 'SO', 'WEC', 'XEL', 'AWK', 'NEE']
Failed to scrape information for the following companies: ['SCG']
Please try again later...


In [78]:
energy_list, energy_df_dict = get_bulk_financials(energy_list)

Failed: APC <class 'AssertionError'>
Failed: ANDV <class 'AssertionError'>
Failed: BHGE <class 'AssertionError'>
Failed: NFX <class 'AssertionError'>
------------------------------------------
Now we are trying a second round for the error items due to potential server-side detections.
Failed: APC <class 'AssertionError'>
Failed: ANDV <class 'AssertionError'>
Failed: NFX <class 'AssertionError'>
------------------------------------------
Now we are trying a third round for the error items due to potential server-side detections.
Failed: APC <class 'AssertionError'>
Failed: ANDV <class 'AssertionError'>
Failed: NFX <class 'AssertionError'>
Successful srape for the following companies: ['APA', 'COG', 'CHK', 'CVX', 'XEC', 'CXO', 'COP', 'DVN', 'EOG', 'EQT', 'XOM', 'HAL', 'HP', 'HES', 'KMI', 'MRO', 'MPC', 'NOV', 'NBL', 'OXY', 'OKE', 'PSX', 'PXD', 'RRC', 'SLB', 'FTI', 'VLO', 'WMB', 'BHGE']
Failed to scrape information for the following companies: ['APC', 'ANDV', 'NFX']
Please try again later

In [79]:
real_estate_list, real_estate_df_dict = get_bulk_financials(real_estate_list)

Failed: AVB <class 'AssertionError'>
Failed: CBG <class 'AssertionError'>
Failed: FRT <class 'AssertionError'>
Failed: GGP <class 'AssertionError'>
Failed: HCN <class 'AssertionError'>
------------------------------------------
Now we are trying a second round for the error items due to potential server-side detections.
Failed: CBG <class 'AssertionError'>
Failed: GGP <class 'AssertionError'>
Failed: HCN <class 'AssertionError'>
------------------------------------------
Now we are trying a third round for the error items due to potential server-side detections.
Failed: CBG <class 'AssertionError'>
Failed: GGP <class 'AssertionError'>
Failed: HCN <class 'AssertionError'>
Successful srape for the following companies: ['ARE', 'AMT', 'AIV', 'BXP', 'CCI', 'DLR', 'DRE', 'EQIX', 'EQR', 'ESS', 'EXR', 'HCP', 'HST', 'IRM', 'KIM', 'MAC', 'MAA', 'PLD', 'PSA', 'O', 'REG', 'SBAC', 'SPG', 'SLG', 'UDR', 'VTR', 'VNO', 'WY', 'AVB', 'FRT']
Failed to scrape information for the following companies: ['CBG'

In [80]:
heath_care_list, heath_care_df_dict = get_bulk_financials(heath_care_list)

Failed: AET <class 'AssertionError'>
Failed: A <class 'AssertionError'>
Failed: ANTM <class 'AssertionError'>
Failed: BAX <class 'AssertionError'>
Failed: XRAY <class 'AssertionError'>
Failed: EVHC <class 'AssertionError'>
Failed: ESRX <class 'AssertionError'>
Failed: SYK <class 'AssertionError'>
Failed: ZBH <class 'AssertionError'>
------------------------------------------
Now we are trying a second round for the error items due to potential server-side detections.
Failed: AET <class 'AssertionError'>
Failed: A <class 'AssertionError'>
Failed: EVHC <class 'AssertionError'>
Failed: ESRX <class 'AssertionError'>
------------------------------------------
Now we are trying a third round for the error items due to potential server-side detections.
Failed: AET <class 'AssertionError'>
Failed: EVHC <class 'AssertionError'>
Failed: ESRX <class 'AssertionError'>
Successful srape for the following companies: ['ABT', 'ABBV', 'ALXN', 'ALGN', 'AGN', 'ABC', 'AMGN', 'BDX', 'BIIB', 'BSX', 'BMY', 'C

In [81]:
indus_list, indus_df_dict = get_bulk_financials(indus_list)

Failed: ETN <class 'AssertionError'>
Failed: LLL <class 'AssertionError'>
Failed: COL <class 'AssertionError'>
------------------------------------------
Now we are trying a second round for the error items due to potential server-side detections.
Failed: LLL <class 'AssertionError'>
Failed: COL <class 'AssertionError'>
------------------------------------------
Now we are trying a third round for the error items due to potential server-side detections.
Failed: LLL <class 'AssertionError'>
Failed: COL <class 'AssertionError'>
Successful srape for the following companies: ['MMM', 'AOS', 'AYI', 'ALK', 'ALLE', 'AAL', 'AME', 'ARNC', 'BA', 'CHRW', 'CAT', 'CTAS', 'CSX', 'CMI', 'DE', 'DAL', 'DOV', 'EMR', 'EFX', 'EXPD', 'FAST', 'FDX', 'FLS', 'FLR', 'FTV', 'FBHS', 'GD', 'GE', 'GWW', 'HON', 'HII', 'INFO', 'ITW', 'IR', 'JBHT', 'JEC', 'JCI', 'KSU', 'LMT', 'MAS', 'NLSN', 'NSC', 'NOC', 'PCAR', 'PH', 'PNR', 'PWR', 'RTN', 'RSG', 'RHI', 'ROK', 'ROP', 'LUV', 'SRCL', 'TXT', 'TDG', 'UNP', 'UAL', 'UPS', 'U

In [82]:
fin_list, fin_df_dict = get_bulk_financials(fin_list)

Failed: BRK.B <class 'AttributeError'>
Failed: LUK <class 'AssertionError'>
Failed: RF <class 'AssertionError'>
Failed: TMK <class 'AssertionError'>
Failed: XL <class 'AssertionError'>
------------------------------------------
Now we are trying a second round for the error items due to potential server-side detections.
Failed: BRK.B <class 'AttributeError'>
Failed: LUK <class 'AssertionError'>
Failed: TMK <class 'AssertionError'>
Failed: XL <class 'AssertionError'>
------------------------------------------
Now we are trying a third round for the error items due to potential server-side detections.
Failed: BRK.B <class 'AttributeError'>
Failed: LUK <class 'AssertionError'>
Failed: TMK <class 'AssertionError'>
Failed: XL <class 'AssertionError'>
Successful srape for the following companies: ['AMG', 'AFL', 'ALL', 'AXP', 'AIG', 'AMP', 'AON', 'AJG', 'AIZ', 'BAC', 'BBT', 'BLK', 'HRB', 'BHF', 'COF', 'CBOE', 'SCHW', 'CB', 'CINF', 'C', 'CFG', 'CME', 'CMA', 'DFS', 'ETFC', 'RE', 'FITB', 'BEN', 

In [83]:
info_tect_list, info_tect_df_dict = get_bulk_financials(info_tect_list)

Failed: GOOGL <class 'AssertionError'>
Failed: APH <class 'AssertionError'>
Failed: CA <class 'AssertionError'>
Failed: CSRA <class 'AssertionError'>
Failed: HRS <class 'AssertionError'>
Failed: RHT <class 'AssertionError'>
Failed: TSS <class 'AssertionError'>
------------------------------------------
Now we are trying a second round for the error items due to potential server-side detections.
Failed: CA <class 'AssertionError'>
Failed: CSRA <class 'AssertionError'>
Failed: HRS <class 'AssertionError'>
Failed: RHT <class 'AssertionError'>
Failed: TSS <class 'AssertionError'>
------------------------------------------
Now we are trying a third round for the error items due to potential server-side detections.
Failed: CA <class 'AssertionError'>
Failed: CSRA <class 'AssertionError'>
Failed: HRS <class 'AssertionError'>
Failed: RHT <class 'AssertionError'>
Failed: TSS <class 'AssertionError'>
Successful srape for the following companies: ['ACN', 'ATVI', 'ADBE', 'AMD', 'AKAM', 'ADS', 'GOO

In [84]:
cons_disc_list, cons_disc_df_dict = get_bulk_financials(cons_disc_list)

Failed: DISCK <class 'AssertionError'>
Failed: KORS <class 'AssertionError'>
Failed: NWS <class 'AssertionError'>
Failed: PCLN <class 'AssertionError'>
Failed: SNI <class 'AssertionError'>
Failed: TWX <class 'AssertionError'>
Failed: WYN <class 'AssertionError'>
------------------------------------------
Now we are trying a second round for the error items due to potential server-side detections.
Failed: KORS <class 'AssertionError'>
Failed: PCLN <class 'AssertionError'>
Failed: SNI <class 'AssertionError'>
Failed: TWX <class 'AssertionError'>
Failed: WYN <class 'AssertionError'>
------------------------------------------
Now we are trying a third round for the error items due to potential server-side detections.
Failed: KORS <class 'AssertionError'>
Failed: PCLN <class 'AssertionError'>
Failed: SNI <class 'AssertionError'>
Failed: TWX <class 'AssertionError'>
Failed: WYN <class 'AssertionError'>
Successful srape for the following companies: ['AAP', 'AMZN', 'APTV', 'AZO', 'BBY', 'BWA',

In [85]:
cons_staple_list, cons_staple_df_dict = get_bulk_financials(cons_staple_list)

Failed: MO <class 'AssertionError'>
Failed: BF.B <class 'AssertionError'>
Failed: DPS <class 'AssertionError'>
------------------------------------------
Now we are trying a second round for the error items due to potential server-side detections.
Failed: BF.B <class 'AssertionError'>
Failed: DPS <class 'AssertionError'>
------------------------------------------
Now we are trying a third round for the error items due to potential server-side detections.
Failed: BF.B <class 'AssertionError'>
Failed: DPS <class 'AssertionError'>
Successful srape for the following companies: ['ADM', 'CPB', 'CHD', 'KO', 'CL', 'CAG', 'STZ', 'COST', 'COTY', 'CVS', 'EL', 'GIS', 'HRL', 'SJM', 'K', 'KMB', 'KHC', 'KR', 'MKC', 'TAP', 'MDLZ', 'MNST', 'PEP', 'PM', 'PG', 'SYY', 'CLX', 'HSY', 'TSN', 'WMT', 'WBA', 'MO']
Failed to scrape information for the following companies: ['BF.B', 'DPS']
Please try again later...


### Combine all Sector Dictionaries into One Combined Dictionary

In [96]:
spx_df_dict = {**cons_staple_df_dict,
               **cons_disc_df_dict,
               **info_tect_df_dict,
               **fin_df_dict,
               **indus_df_dict,
               **heath_care_df_dict,
               **real_estate_df_dict,
               **energy_df_dict,
               **utl_df_dict,
               **materials_df_dict,
               **telecom_df_dict}

In [113]:
# Save this dictionary in a csv file just in case.
with open('./data/spx_dict.csv', 'w', newline="") as csv_file:
    writer = csv.writer(csv_file)
    for key, value in spx_df_dict.items():
        writer.writerow([key, value])

### Combine Financial Statements Into One DataFrame

As the accounting system allows certain flexibility, companies report slightly different line items in their financial statements. It would be a mess to directly concat the DataFrames. 

Instead, we try to take a look under the hood and only grab the key columns we need for our purpose. We need to make sure the columns that we need for modeling are common among the companies. To do this, we created another function to quickly test if a particular column is widely reported by companies tracked in S&P 500. We are going to use this more in the following EDA section.

We will save this as a csv file for future use.

In [207]:
spx_df = pd.DataFrame()

for v in spx_df_dict.values():
    spx_df = pd.concat([spx_df, v], sort=False)

In [224]:
spx_df

Unnamed: 0_level_0,Ticker,Company,(BS) Cash And Cash Equivalents,(BS) Short Term Investments,(BS) Total Cash,(BS) Net Receivables,(BS) Inventory,(BS) Other Current Assets,(BS) Total Current Assets,"(BS) Gross property, plant and equipment",...,(BS) Stockholders' Equity,(IS) Operating Expenses,(IS) Reported EPS,(IS) Weighted average shares outstanding,(IS) Basic 1,(IS) Diluted 1,(CF) Cash flows from operating activities,(CF) Cash flows from investing activities,(CF) Cash flows from financing activities,(CF) Free Cash Flow 1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-12-30,ADM,Archer-Daniels-Midland Company,910000.0,438000.0,1348000.0,2886000.0,8243000.0,687000.0,21829000.0,23274000.0,...,,,,,,,,,,
2016-12-30,ADM,Archer-Daniels-Midland Company,619000.0,296000.0,915000.0,2426000.0,8831000.0,451000.0,21045000.0,23497000.0,...,,,,,,,,,,
2017-12-30,ADM,Archer-Daniels-Midland Company,804000.0,0.0,804000.0,2424000.0,9173000.0,372000.0,19925000.0,24793000.0,...,,,,,,,,,,
2018-12-30,ADM,Archer-Daniels-Midland Company,1997000.0,6000.0,2003000.0,2683000.0,8813000.0,223000.0,20588000.0,25102000.0,...,,,,,,,,,,
2016-07-30,CPB,Campbell Soup Company,296000.0,,296000.0,554000.0,940000.0,41000.0,1908000.0,5764000.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2018-12-30,CTL,"CenturyLink, Inc.",488000.0,,488000.0,2094000.0,120000.0,108000.0,3820000.0,53267000.0,...,,,,,,,,,,
2015-12-30,VZ,Verizon Communications Inc.,4470000.0,350000.0,4820000.0,13457000.0,1252000.0,792000.0,22280000.0,220163000.0,...,,,,,,,,,,
2016-12-30,VZ,Verizon Communications Inc.,2880000.0,0.0,2880000.0,17513000.0,1202000.0,882000.0,26395000.0,232215000.0,...,,,,,,,,,,
2017-12-30,VZ,Verizon Communications Inc.,2079000.0,,2079000.0,23493000.0,1034000.0,0.0,29913000.0,246498000.0,...,,,,,,,,,,


In [216]:
spx_df.to_csv('./data/spx_df.csv', index=True)

## Data Collection: Scrape Market Statistics At Scale

Similar to above, we are collecting market stats for all 474 companies. This went a lot more smooth. Save this as a csv file

In [155]:
spx_stats = pd.DataFrame()
error_list = []

for company in spx_list:
    try:
        spx_stats = pd.concat([spx_stats, get_stats(company)], sort=True)
    except Exception as e:
        print(type(e), company)
        error_list.append(company)
    sleep(np.random.choice([2,3,4,5]))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [160]:
# Reset the index
spx_stats = spx_stats.reset_index().drop(columns='index')

In [167]:
col_order = list(spx_stats.columns)
col_order.remove('Ticker')
col_order.insert(0, 'Ticker')

In [170]:
spx_stats = spx_stats[col_order]

In [199]:
spx_stats = spx_stats.merge(spx, how='left', left_on='Ticker', right_on='Symbol')

In [223]:
spx_stats

Unnamed: 0,Ticker,% Held by Insiders,% Held by Institutions,200-Day Moving Average,5 Year Average Dividend Yield,50-Day Moving Average,52 Week High,52 Week Low,52-Week Change,Avg Vol (10 day),...,Total Cash (mrq),Total Cash Per Share (mrq),Total Debt (mrq),Total Debt/Equity (mrq),Trailing Annual Dividend Rate,Trailing Annual Dividend Yield,Trailing P/E,Symbol,Name,Sector
0,ADM,0.0032,0.8037,40.39,2.83,41.98,47.16,36.45,-0.0835,1860000.0,...,9.580000e+08,1.72,9.760000e+09,51.52,1.38,0.0323,20.61,ADM,Archer-Daniels-Midland Co,Consumer Staples
1,CPB,0.4438,0.5386,43.61,2.85,46.97,48.39,32.03,0.2015,1810000.0,...,3.100000e+07,0.10,8.470000e+09,762.05,1.40,0.0301,67.44,CPB,Campbell Soup,Consumer Staples
2,CHD,0.0023,0.8844,74.39,1.46,70.63,80.99,59.64,0.0577,1820000.0,...,1.147000e+08,0.47,2.350000e+09,91.63,0.90,0.0128,28.61,CHD,Church & Dwight,Consumer Staples
3,KO,0.0075,0.6899,52.82,3.18,53.33,55.92,44.42,0.0747,10020000.0,...,1.299000e+10,3.03,4.380000e+10,211.79,1.59,0.0298,29.68,KO,Coca-Cola Company (The),Consumer Staples
4,CL,0.0037,0.7851,71.10,2.29,67.36,76.41,57.51,0.0637,2610000.0,...,1.060000e+09,1.23,8.800000e+09,5059.77,1.70,0.0251,25.03,CL,Colgate-Palmolive,Consumer Staples
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
469,VMC,0.0023,0.9878,139.69,0.69,142.45,152.49,90.04,0.3376,907160.0,...,9.041000e+07,0.68,3.200000e+09,57.82,1.21,0.0085,31.44,VMC,Vulcan Materials,Materials
470,WRK,0.0136,0.9014,36.32,3.12,38.68,48.55,31.94,-0.1468,3310000.0,...,1.516000e+08,0.59,1.006000e+10,86.11,1.82,0.0451,12.13,WRK,WestRock Company,Materials
471,T,0.0007,0.5544,35.36,5.49,38.28,39.70,26.80,0.1788,24900000.0,...,6.690000e+09,0.92,1.974900e+11,101.63,2.04,0.0546,16.73,T,AT&T Inc,Telecommunication Services
472,CTL,0.0078,0.7520,12.06,9.86,13.75,19.05,9.64,-0.2288,10460000.0,...,1.400000e+09,1.29,3.689000e+10,269.82,1.29,0.0890,,CTL,CenturyLink Inc,Telecommunication Services


In [200]:
spx_stats.to_csv('./data/spx_stats.csv')