Rate limit edgar requests #24

rpocase · 2021-05-19T00:51:02Z

The SEC website recently (within the last couple months) added rate limiting to their website. Currently, none of this libraries requests properly respond to it. This leads to hard to decode errors and makes this generally much less usable in a scripted fashion. When an IP is detected as needing rate limiting, the SEC website returns a 403 response with a body that looks like the below text.

<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>SEC.gov | Request Rate Threshold Exceeded</title>
<style>
html {height: 100%}
body {height: 100%; margin:0; padding:0;}
#header {background-color:#003968; color:#fff; padding:15px 20px 10px 20px;font-family:Arial, Helvetica, sans-serif; font-size:20px; border-bottom:solid 5px #000;}
#footer {background-color:#003968; color:#fff; padding:15px 20px;font-family:Arial, Helvetica, sans-serif; font-size:20px;}
#content {max-width:650px;margin:60px auto; padding:0 20px 100px 20px; background-image:url(seal_bw.png);background-repeat:no-repeat;background-position:50% 100%;}
h1 {font-family:Georgia, Times, serif; font-size:20px;}
h2 {text-align:center; font-family:Georgia, Times, serif; font-size:20px; width:100%; border-bottom:solid #999 1px;padding-bottom:10px; margin-bottom:20px;}
h3 {font-family:Georgia, Times, serif; font-size:16px; margin:25px 0 0 0;}
p {font-family:Verdana, Geneva, sans-serif;font-size:14px;line-height:1.3;}
.grey_box {background-color:#eee; padding:5px 40px 20px 40px;margin-top:75px;}
.grey_box p {font-size:12px;line-height:1.5}
.note {padding: 0 40px; font-style: italic;}
</style>
</head>

<body>
<div id="header">U.S. Securities and Exchange Commission</div>
<div id="content">
<h1>Your Request Originates from an Undeclared Automated Tool</h1>
<p>To allow for equitable access to all users, SEC reserves the right to limit requests originating from undeclared automated tools. Your request has been identified as pa
rt of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic.</p>

<p>Please declare your traffic by updating your user agent to include company specific information.</p>


<p>For best practices on efficiently downloading information from SEC.gov, including the latest EDGAR filings, visit <a href="https://www.sec.gov/developer" target="_blank
">sec.gov/developer</a>. You can also <a href="https://public.govdelivery.com/accounts/USSEC/subscriber/new?topic_id=USSEC_260" target="_blank">sign up for email updates</
a> on the SEC open data program, including best practices that make it more efficient to download data, and SEC.gov enhancements that may impact scripted downloading proce
sses. For more information, contact <a href="mailto:opendata@sec.gov">opendata@sec.gov</a>.</p>

<p>For more information, please see the SECâs <a href="#internet">Web Site Privacy and Security Policy</a>. Thank you for your interest in the U.S. Securities and Exchange Commission.
</p><p>Reference ID: 0.8fa83817.1621385100.1c3be594</p>
<div class="grey_box">
<h2>More Information</h2>
<h3><a name="internet" id="internet">Internet Security Policy</a></h3>

<p>By using this site, you are agreeing to security monitoring and auditing. For security purposes, and to ensure that the public service remains available to users, this government computer system employs programs to monitor network traffic to identify unauthorized attempts to upload or change information or to otherwise cause damage, including attempts to deny service to users.</p>

<p>Unauthorized attempts to upload information and/or change information on any portion of this site are strictly prohibited and are subject to prosecution under the Computer Fraud and Abuse Act of 1986 and the National Information Infrastructure Protection Act of 1996 (see Title 18 U.S.C. Â§Â§ 1001 and 1030).</p>

<p>To ensure our website performs well for all users, the SEC monitors the frequency of requests for SEC.gov content to ensure automated searches do not impact the ability of others to access SEC.gov content. We reserve the right to block IP addresses that submit excessive requests.  Current guidelines limit users to a total of no more than 10 requests per second, regardless of the number of machines used to submit requests. </p>

<p>If a user or application submits more than 10 requests per second, further requests from the IP address(es) may be limited for a brief period. Once the rate of requests has dropped below the threshold for 10 minutes, the user may resume accessing content on SEC.gov. This SEC practice is designed to limit excessive automated searches on SEC.gov and is not intended or expected to impact individuals browsing the SEC.gov website. </p>

<p>Note that this policy may change as the SEC manages SEC.gov to ensure that the website performs efficiently and remains available to all users.</p>
</div>
<br />
<p class="note"><b>Note:</b> We do not offer technical support for developing or debugging scripted downloading processes.</p>
</div>
</body>
</html>

The text was updated successfully, but these errors were encountered:

rpocase · 2021-05-19T01:18:45Z

For anyone that hits this, I'm able to work around this pretty seamlessly by using requests-random-user-agent. In my use case, the primary issue seems to be the lack of a custom user agent, resulting in automation denial much sooner. If I am not doing an excessive amount of scraping then I can mostly not worry about the rate limiting for now. Just importing the library is enough for any subsequent calls to requests to get a random user agent assigned.

rajah9 · 2021-07-03T12:31:35Z

I'm seeing this error as well. It's throwing an error on line 18 of Edgar.__init__ when trying to parse line 2 ("") into _name and _cik. I can work around the error (without a code change) by going through NordVPN.

joeyism · 2021-07-11T10:02:06Z

@rpocase Do you think you can submit a MR for it?

rpocase · 2021-07-14T17:58:40Z

@joeyism I'd love to, but don't know that I'll find the time to do a proper fix. My workaround has been sufficient for my needs, but I wouldn't recommend introducing it into the base library as the "right" fix.

tommycarstensen · 2021-07-25T02:11:38Z

Just slow it down with time.sleep(10) and it will work fine.

If a user or application submits more than 10 requests per second, further requests from the IP address(es) may be limited for a brief period.

drGarbinsky · 2021-08-19T19:06:59Z

I think there are two issues blended together hear. Rate limiting and user agent string.

I am working on a fix for the user agent string that uses a singleton session object where you can set default headers. This also has the added benefit of reusing a pooled TCP connection.

As far as rate limiting goes the SEC should return a 429 and that should be handled with some sort of back off. Has anyone confirmed that the SEC returns a 429? They are also beta testing RESTful APIs so this may all be moot.

rpocase · 2021-08-20T16:15:08Z

Has anyone confirmed that the SEC returns a 429?

Unless anything has changed since issue creation, they respond with a 403.

mahantymanoj · 2021-09-27T11:02:20Z

final_url = 'https://www.sec.gov/Archives/edgar/data/0000866273/000086627321000088/0000866273-21-000088-index.htm'
good_read = False
while good_read == False:
sleep(3)
try:
user_agent = {'User-agent' : ua.random}
conn = Request(final_url, headers=user_agent)
response = urlopen(conn)
try:
table_response = table_response.read()
good_read = True
finally:
pass
except HTTPError as e:
print( "HTTP Error:", e.code, end=" ")
except URLError as e:
print( "URL Error:", e.reason, end=" ")
except TimeoutError as e:
print( "Timeout Error:", e.reason, end=" ")

I am using the above code to scrap filing from the sec.gov website using a random user-agent. for random user-agent I have used a fake user agent package. Still, I am facing HTTP Error 403. What is the solution to avoid 403 errors? It was working before I am have been facing Error 403 for the last few days.

lhyleo · 2021-11-04T00:02:36Z

final_url = 'https://www.sec.gov/Archives/edgar/data/0000866273/000086627321000088/0000866273-21-000088-index.htm' good_read = False while good_read == False: sleep(3) try: user_agent = {'User-agent' : ua.random} conn = Request(final_url, headers=user_agent) response = urlopen(conn) try: table_response = table_response.read() good_read = True finally: pass except HTTPError as e: print( "HTTP Error:", e.code, end=" ") except URLError as e: print( "URL Error:", e.reason, end=" ") except TimeoutError as e: print( "Timeout Error:", e.reason, end=" ")

I am using the above code to scrap filing from the sec.gov website using a random user-agent. for random user-agent I have used a fake user agent package. Still, I am facing HTTP Error 403. What is the solution to avoid 403 errors? It was working before I am have been facing Error 403 for the last few days.

Follow https://www.sec.gov/os/webmaster-faq#developers on how to formulate the headers for user-agent. Even a fake email would work. I think they are doing some regex matching only

mahantymanoj · 2021-11-04T03:07:22Z

final_url = 'https://www.sec.gov/Archives/edgar/data/0000866273/000086627321000088/0000866273-21-000088-index.htm' good_read = False while good_read == False: sleep(3) try: user_agent = {'User-agent' : ua.random} conn = Request(final_url, headers=user_agent) response = urlopen(conn) try: table_response = table_response.read() good_read = True finally: pass except HTTPError as e: print( "HTTP Error:", e.code, end=" ") except URLError as e: print( "URL Error:", e.reason, end=" ") except TimeoutError as e: print( "Timeout Error:", e.reason, end=" ")
I am using the above code to scrap filing from the sec.gov website using a random user-agent. for random user-agent I have used a fake user agent package. Still, I am facing HTTP Error 403. What is the solution to avoid 403 errors? It was working before I am have been facing Error 403 for the last few days.

Follow https://www.sec.gov/os/webmaster-faq#developers on how to formulate the headers for user-agent. Even a fake email would work. I think they are doing some regex matching only

Thanks, buddy. A few weeks earlier I came across the article, SEC has restricted their request and in URL one has to define hostname, user agent with email id. I have used my email address to access the filing link and it is now working.

eabase · 2022-02-06T12:49:20Z

@mahantymanoj
Can post your code snippet within a triple back tick ( ``` ) block for proper markup, and how to implement it?

eabase · 2022-02-06T18:01:07Z

@joeyism
How can we override the built-in requests that Edgar does, using our own request headers?

* Added new required request headers for SEC EDGAR * Added UTF-8 header * Changed dependecy from broken fuzzwuzz to new rapidfuzz. Fixes joeyism#28 * Fixes joeyism#24 (with user comment poor-mans patch for checking rate limits) * minor formatting adjustments for code readabiltiy Changes to be committed: modified: edgar/company.py modified: edgar/document.py modified: edgar/edgar.py modified: requirements.txt

mahantymanoj · 2022-02-07T03:19:15Z

@mahantymanoj Can post your code snippet within a triple back tick ( ``` ) block for proper markup, and how to implement it?

def urlRequest(final_url,user_agent):
    """ Function is used for URL Request, function use urllib library to request """
    conn = Request(final_url, headers=user_agent)
    response = urlopen(conn, timeout=20)
    return  response


def urlRequestHit(link,ua):
    good_read = False
    while good_read == False:
        sleep(5)
        try:
            user_agent = {'User-agent' : ua, 'Host': 'www.sec.gov'}  
            table_response = urlRequest(link,user_agent)
            try:
                table_response = table_response.read()
                table_response = table_response.decode('utf-8')
                good_read = True
            finally:
                pass
        except HTTPError as e:
            print( "HTTP Error:", e.code, end=" ")
        except URLError as e:
            print( "URL Error:", e.reason, end=" ")
        except TimeoutError as e:
            print( "Timeout Error:", e.reason, end=" ")
    return table_response

#### function call
### ua = <your email address example@gmail.com>
xml_response = urlRequestHit(xml_htm, ua)

I am using an XBRL data file. web scraping all XML links and transforming them to DataFrames.

eabase · 2022-02-07T14:44:09Z

@mahantymanoj
You need to use the back-ticks (`), not single quotes (') to get correct markup.

mahantymanoj · 2022-02-07T15:33:15Z

@mahantymanoj You need to use the back-ticks (), not single quotes ('`) to get correct markup.

edit done...

* Added new required request headers for SEC EDGAR * Added UTF-8 header * Changed dependecy from broken fuzzwuzz to new rapidfuzz. Fixes #28 * Fixes #24 (with user comment poor-mans patch for checking rate limits) * minor formatting adjustments for code readabiltiy Changes to be committed: modified: edgar/company.py modified: edgar/document.py modified: edgar/edgar.py modified: requirements.txt

gsterndale · 2023-07-12T17:01:34Z

The folks at the SEC published some "guidance" on their FAQ that includes sample request headers. You can find it here.

eabase mentioned this issue Feb 6, 2022

Parsing issues with Subsidiaries NetSPI/NetblockTool#3

Closed

eabase mentioned this issue Feb 7, 2022

♦️ - maintenace update for required SEC headers #29

Merged

joeyism closed this as completed in #29 Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate limit edgar requests #24

Rate limit edgar requests #24

rpocase commented May 19, 2021 •

edited

rpocase commented May 19, 2021

rajah9 commented Jul 3, 2021

joeyism commented Jul 11, 2021

rpocase commented Jul 14, 2021

tommycarstensen commented Jul 25, 2021

drGarbinsky commented Aug 19, 2021 •

edited

rpocase commented Aug 20, 2021

mahantymanoj commented Sep 27, 2021

lhyleo commented Nov 4, 2021 •

edited

mahantymanoj commented Nov 4, 2021

eabase commented Feb 6, 2022 •

edited

eabase commented Feb 6, 2022

mahantymanoj commented Feb 7, 2022 •

edited

eabase commented Feb 7, 2022

mahantymanoj commented Feb 7, 2022

gsterndale commented Jul 12, 2023

Rate limit edgar requests #24

Rate limit edgar requests #24

Comments

rpocase commented May 19, 2021 • edited

rpocase commented May 19, 2021

rajah9 commented Jul 3, 2021

joeyism commented Jul 11, 2021

rpocase commented Jul 14, 2021

tommycarstensen commented Jul 25, 2021

drGarbinsky commented Aug 19, 2021 • edited

rpocase commented Aug 20, 2021

mahantymanoj commented Sep 27, 2021

lhyleo commented Nov 4, 2021 • edited

mahantymanoj commented Nov 4, 2021

eabase commented Feb 6, 2022 • edited

eabase commented Feb 6, 2022

mahantymanoj commented Feb 7, 2022 • edited

eabase commented Feb 7, 2022

mahantymanoj commented Feb 7, 2022

gsterndale commented Jul 12, 2023

rpocase commented May 19, 2021 •

edited

drGarbinsky commented Aug 19, 2021 •

edited

lhyleo commented Nov 4, 2021 •

edited

eabase commented Feb 6, 2022 •

edited

mahantymanoj commented Feb 7, 2022 •

edited