# Download Text Files

In this notebook, we download the latest 10-K of 2022 for each of the firms of the S&P500.  The list of S&P500 tickers and CIKs is gathered from Wikipedia.  All 10-Ks are gathered from SEC EDGAR using `sec_edgar_downloader` with the company's ticker.  Any mistaken files are redownloaded using their CIK as a more robust identifier.

## Setup



In [1]:
import glob
import os
import warnings
import re
import pandas as pd
import numpy as np

from requests_html import HTMLSession
from sec_edgar_downloader import Downloader
from tqdm import tqdm
from bs4 import BeautifulSoup

warnings.filterwarnings(
    "ignore",
    message="It looks like you're parsing an XML document using an HTML parser",)

## List Tickers

Below, we gather the list of each S&P500 firm from Wikipedia.  If the list already exists in the inputs folder, we skip the download.

In [2]:
sp500_path = "inputs/s&p500_2022.csv"

if not os.path.exists(sp500_path):
    url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
    pd.read_html(url)[0].to_csv(sp500_path, index=False)  # [1] shows updates

sp500 = pd.read_csv(sp500_path)
sp500

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded,truth_path
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902,MMM
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916,AOS
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888,ABT
3,ABBV,AbbVie,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888),ABBV
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989,ACN
...,...,...,...,...,...,...,...,...,...
498,YUM,Yum! Brands,Consumer Discretionary,Restaurants,"Louisville, Kentucky",1997-10-06,1041061,1997,YUM
499,ZBRA,Zebra Technologies,Information Technology,Electronic Equipment & Instruments,"Lincolnshire, Illinois",2019-12-23,877212,1969,ZBRA
500,ZBH,Zimmer Biomet,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,1136869,1927,ZBH
501,ZION,Zions Bancorporation,Financials,Regional Banks,"Salt Lake City, Utah",2001-06-22,109380,1873,0000109380


## Download 10-Ks

In this step, we download the latest 10-K file of each firm in 2022.  We start by setting up variables that are consistent in each iteration of the loop, such as the most recent allowed download date, the number of files to download, and the type of files (10-K).  We also ensure directories are set up and usable.

If files are already downloaded, we skip the download.

In [3]:
# Set up downloader variables
tics = sp500['Symbol'].to_list()
before = '2023-01-01'
res_path = '10K_files'
download_type = '10-K'
amount = 1

# Ensure the directory and downloader exist
os.makedirs(res_path, exist_ok=True)
dl = Downloader(res_path)

In [4]:
# Loop over each ticker
for tic in tqdm(tics):
    # Check if the files were already downloaded
    tic_res_path = fr'{res_path}/sec-edgar-filings/{tic}/{download_type}'
    file_downloaded = (
        os.path.exists(tic_res_path) and len(os.listdir(tic_res_path)) >= amount
    )  # quick check
    if not file_downloaded:
        try:
            dl.get(download_type, tic, before=before, amount=amount)
        except Exception as error:
            print(f'Error on {tic}: {repr(error)}')

    # Check and delete any .txt files in path
    for file in glob.glob(tic_res_path + '/*/*.txt'):
        os.remove(file)

100%|███████████████████████████████████████████████████████████████████████████████| 503/503 [00:02<00:00, 187.25it/s]


## Check for Missing and Erroneous Filings

Checking the first ticker, A, we find that the `sec_edgar_downloader` incorrectly downloaded ticker HEI.A (HEICO Corporation) instead of A (Agilent Technologies).  In this loop, we find other companies whose names cannot be found in the 10-K.  We also note any tickers that are missing from the folder structure.

Note that 496 of the 503 expected folders are present at the beginning of this loop.  For this reason, we should download these firms with the constant CIK value as their tickers are not available (or they are duplicates).

In [5]:
# Replace misspellings in sp500 table
try:
    sp500.loc[sp500['Symbol'] == 'GWW', ['Security']] = 'W.W. Grainger'
    sp500.loc[sp500['Symbol'] == 'COO', ['Security']] = 'The Cooper Companies, Inc.'
    sp500.loc[sp500['Symbol'] == 'HSY', ['Security']] = 'The Hershey Company'
    sp500.loc[sp500['Symbol'] == 'MCO', ['Security']] = "Moody's Corporation"
    sp500.loc[sp500['Symbol'] == 'ORLY', ['Security']] = "O'Reilly Automotive, Inc."
    sp500.loc[sp500['Symbol'] == 'PH', ['Security']] = 'Parker-Hannifin Corporation'
except Exception as error:
    print('Warning: could not replace expected symbol names')
    print(repr(error))

In [6]:
# Create sets to store the names of companies
missing_path_tics = set()
missing_content_tics = set()

# Loop over each ticker
for tic in tics:
    tic_res_path = fr'{res_path}/sec-edgar-filings/{tic}/{download_type}'
    
    # Check that a file exists at its expected file path
    if not os.path.exists(tic_res_path):
        missing_path_tics.add(tic)
        continue
    
    # Find any downloaded 10-Ks
    for file in glob.glob(tic_res_path+'/*/*.html'):
        with open(file, 'rb') as report_file:
            html = report_file.read()
        
        # Get the approximate firm name
        tic_name = sp500.loc[sp500['Symbol'] == tic]['Security'].values[0]
        appx_tic = re.sub('\s*&\s*', '.+', tic_name)
        appx_tic = re.sub('\(.*\)', '', appx_tic)
        appx_tic = re.sub(r"[^\x00-\x7F]|'", '.', appx_tic)    # non ascii
        appx_tic = re.sub('(,\s*)?(Inc|Co)\.', '', appx_tic).strip()
        
        # Search for the firm name in the file
        if not re.search(appx_tic.encode('utf-8'), html, re.IGNORECASE):
            missing_content_tics.add(tic)

print('Cannot find path for ', len(missing_path_tics), ' tickers:\n', missing_path_tics)
print('\nCannot find company name in 10-Ks of ', len(missing_content_tics), ' tickers:\n', missing_content_tics)

Cannot find path for  7  tickers:
 {'BF.B', 'WBD', 'BRK.B', 'GEHC', 'ELV', 'FRC', 'SBNY'}

Cannot find company name in 10-Ks of  47  tickers:
 {'WY', 'IP', 'DE', 'ALL', 'ON', 'PM', 'CAT', 'MU', 'C', 'GM', 'KEY', 'IPG', 'F', 'WELL', 'NOW', 'CF', 'CMG', 'MA', 'O', 'MCO', 'RF', 'FAST', 'ICE', 'MET', 'ZION', 'BK', 'FE', 'PEAK', 'MS', 'J', 'BA', 'GEN', 'CB', 'NI', 'BIO', 'IT', 'WM', 'ESS', 'IR', 'META', 'D', 'L', 'A', 'SWK', 'HST', 'DRI', 'TECH'}


## Replace Missing and Erroneous Filings

Now, we download 10-Ks using the CIKs of firms that were incorrect or not present in the initial download.

Then, we add a column to the data indicating which path is the correct path for each ticker.  Alternatively, we could move these files to the expected ticker path.

In [7]:
# Join all failed 10-Ks
missing_all_tics = set()
missing_all_tics.update(missing_path_tics)
missing_all_tics.update(missing_content_tics)

# Download via CIK
for tic in tqdm(missing_all_tics):
    # Look up the CIK in the sp500 df
    cik = sp500.loc[sp500['Symbol'] == tic, 'CIK'].item()
    cik_path = '{0:0>10}'.format(cik)    # All CIKs are 10 characters long on download
    
    # Download based on the CIK from EDGAR
    cik_res_path = fr'{res_path}/sec-edgar-filings/{cik_path}/{download_type}'
    file_downloaded = (
        os.path.exists(cik_res_path) and len(os.listdir(cik_res_path)) >= amount
    )  # quick check
    if not file_downloaded:
        try:
            dl.get(download_type, cik, before=before, amount=amount)
        except Exception as error:
            print(f'Error on {tic}: {repr(error)}')

    # Check and delete any .txt files in path
    for file in glob.glob(cik_res_path + '/*/*.txt'):
        os.remove(file)

100%|██████████████████████████████████████████████████████████████████████████████████| 54/54 [00:00<00:00, 54.32it/s]


In [8]:
# Add a new column to the sp500 set indicating which path to access
sp500['truth_path'] = np.where(
        sp500['Symbol'].isin(missing_all_tics),
        sp500['CIK'].apply(lambda x: '{0:0>10}'.format(x)),    # all CIK paths prepended with 0s
        sp500['Symbol'])

# Persist updates
sp500.to_csv(sp500_path, index=False)