<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Getting-Started" data-toc-modified-id="Getting-Started-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Getting Started</a></span><ul class="toc-item"><li><span><a href="#Configuration" data-toc-modified-id="Configuration-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Configuration</a></span></li><li><span><a href="#Download-the-Master-Indexes" data-toc-modified-id="Download-the-Master-Indexes-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Download the Master Indexes</a></span></li><li><span><a href="#Check-Download-Plan" data-toc-modified-id="Check-Download-Plan-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Check Download Plan</a></span></li></ul></li><li><span><a href="#Downloading" data-toc-modified-id="Downloading-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Downloading</a></span></li></ul></div>

In [1]:
from EDGARConnect import EDGARConnect

# Getting Started

Instantiate an EDGARConnect object and tell it the path you want to write all the output to. You can also pass in a dictionary of headers or a dictionary of retry arguments to be passed to the Requests session. By default, EDGARConnect will use a fake user-agent (the <a href="">fake-useragent</a> package is required), and some reasonable header values. 

Default back-off behavior exponential back-off is 8 retries with a base of 2. See the docstring for more details.

In [2]:
edgar = EDGARConnect(edgar_path="")

Print the object to check the configuration status

In [3]:
print(edgar)

SEC Edgar Scraper for Python, v0.0
Files to be scraped have NOT been defined.
Choose scraping targets using the configure_downloader() method


## Configuration

Call the configure_downloader() method to tell it which forms and date ranges you are interested in. end_date = None tells it to go up to the present day.

In [4]:
edgar.configure_downloader(target_forms="10-K", start_date="2021-07-01", end_date=None)

In [5]:
print(edgar)

SEC Edgar Scraper for Python, v0.0
EDGARConnect is configured for scraping.
	 Target Forms: ['10-K']
	 Date Range: 2021Q3 to 2021Q3



You can also ask for multiple filings by passing in a list

In [6]:
edgar.configure_downloader(
    target_forms=["10-K", "10-Q"], start_date="2021-07-01", end_date=None
)
print(edgar)

SEC Edgar Scraper for Python, v0.0
EDGARConnect is configured for scraping.
	 Target Forms: ['10-K', '10-Q']
	 Date Range: 2021Q3 to 2021Q3



Finally, for convenience, the EDGARConnect instance has a built-in dictionary of closely related forms. These lists were taken from Bill McDonald and Tim Loughran's EDGAR download script. Keys for the built-in dictionary can be displayed using the <code>show_available_forms()</code> method

In [7]:
edgar.show_available_forms()

Available forms:
f_10k -> ['10-K', '10-K405', '10KSB', '10-KSB', '10KSB40']
f_10ka -> ['10-K/A', '10-K405/A', '10KSB/A', '10-KSB/A', '10KSB40/A']
f_10kt -> ['10-KT', '10KT405', '10-KT/A', '10KT405/A']
f_10q -> ['10-Q', '10QSB', '10-QSB']
f_10qa -> ['10-Q/A', '10QSB/A', '10-QSB/A']
f_10qt -> ['10-QT', '10-QT/A']
f_10x -> ['10-K', '10-K405', '10KSB', '10-KSB', '10KSB40', '10-K/A', '10-K405/A', '10KSB/A', '10-KSB/A', '10KSB40/A', '10-KT', '10KT405', '10-KT/A', '10KT405/A', '10-Q', '10QSB', '10-QSB', '10-Q/A', '10QSB/A', '10-QSB/A', '10-QT', '10-QT/A']


And the list can be accessed as follows:

In [8]:
edgar.configure_downloader(
    target_forms=edgar.forms["f_10k"], start_date="2021-07-01", end_date=None
)
print(edgar)

SEC Edgar Scraper for Python, v0.0
EDGARConnect is configured for scraping.
	 Target Forms: ['10-K', '10-K405', '10KSB', '10-KSB', '10KSB40']
	 Date Range: 2021Q3 to 2021Q3



## Download the Master Indexes

EDGARConnect first downloads all the SEC master indexes to your HDD. To do this, use the download_master_indexes() method. These files are quarterly pipe-delimited tables of URLs to corporate filings. By default, EDGARConnect will update the 2 most recent quarters every time you run download_master_indexes(), but you can modify this behavior by passing parameters. 

In [9]:
edgar.download_master_indexes(update_range=0, update_all=False)




## Check Download Plan

After the master lists are downloaded, EDGARConnect can download everything you request from the SEC archive. You can show the download plan using the show_download_plan() method. This is important because the number of filings is quite surprising... it's nice to know what you're signing up for.

In [10]:
edgar.show_download_plan()

EDGARConnect is prepared to download 5 types of filings between 2021Q3 and 2021Q3
	Number of 10-Ks: 64
	Number of 10-K405s: 0
	Number of 10KSBs: 0
	Number of 10-KSBs: 0
	Number of 10KSB40s: 0
	Total files: 64
Estimated download time, assuming 1s per file: 0 Days, 0 hours, 1 minutes, 4 seconds
Estimated drive space, assuming 150KB per filing: 0.01GB


# Downloading

When you're ready to go, use the download_requested_filings() method to start grabbing stuff. It will always check if a file already exists and skip it if it does, so it is robust to starts and stops.

In [None]:
edgar.download_requested_filings(ignore_time_guidelines=True, remove_attachments=False)

Gathering URLS for the requested forms...
Beginning scraping from 2021Q3
2021Q3 10-K       Found 43 / 64 locally, requesting the remaining 21...

There are two arguments that can be passed to the <code>download_requested_filings()</code> method. 

The first is <code>ignore_time_guidelines</code>. The SEC requests that users bulk download only between 9PM and 6AM EST. By default, EDGARConnect will help you check if it's a good time to download and raise an error if it's not. It will also perform this check periodically while downloads are going on (it does it every time a new batch of forms is selected for download).

To disable this behavior, pass <code>ignore_time_guidelines = True</code>. If your download times slow to a crawl it's because the SEC identified you as a mass-downloader and throttled you.

The second is <code>remove_attachments</code>. Despite being .txt files, some filings include large images, spreadsheets, or PDFs, causing the filings to be quite large (the largest I found was 250 MB). If you don't explicitly need these attachments, I recommend passing <code>remove_attachments = True</code>. This will pass all downloaded filings into a function that tries to strip out as many of these attachments as possible, saving considerable disk space when downloading large numbers of filings.