# Rodz Andrie Amor

Note: This was also meant to run in Google Collab.

API used for quickly gathering SEC data

https://sec-api.io/sandbox

We are concerned with primarily the 1A section.


Complete list of all 10-K items
- 1 - Business
- 1A - Risk Factors
- 1B - Unresolved Staff Comments
- 2 - Properties
- 3 - Legal Proceedings
- 4 - Mine Safety Disclosures
- 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
- 6 - Selected Financial Data (prior to February 2021)
- 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
- 7A - Quantitative and Qualitative Disclosures about Market Risk
- 8 - Financial Statements and Supplementary Data
- 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
- 9A - Controls and Procedures
- 9B - Other Information
- 10 - Directors, Executive Officers and Corporate Governance
- 11 - Executive Compensation
- 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
- 13 - Certain Relationships and Related Transactions, and Director Independence
- 14 - Principal Accountant Fees and Services


Companies to test on:
- BP - BP
- Chevron - CVX
- Shell - SHEL
- Exxon Mobil - XOM
- Pioneer Natural Resources Company - PXD

In [None]:
!pip install sec-api

from sec_api import ExtractorApi
from sec_api import QueryApi

import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

from google.colab import files
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Extract the necessary API urls from the SEC EDGAR website. This will be needed for the next section which requires an api link to the relevant sec10k filing in order to scrape it.

In [None]:
# # Used for testing so that API calls are not wasted
# example_response = {
#   "total": {
#     "value": 31,
#     "relation": "eq"
#   },
#   "filings": [
#     {
#       "id": "0de409aa5047085970060a9efa218f8b",
#       "accessionNo": "0000320193-23-000106",
#       "cik": "320193",
#       "ticker": "AAPL",
#       "companyName": "Apple Inc.",
#       "companyNameLong": "Apple Inc. (Filer)",
#       "formType": "10-K",
#       "description": "Form 10-K - Annual report [Section 13 and 15(d), not S-K Item 405]",
#       "filedAt": "2023-11-02T18:08:27-04:00",
#       "linkToTxt": "https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt",
#       "linkToHtml": "https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106-index.htm",
#       "linkToXbrl": "",
#       "linkToFilingDetails": "https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm",
#       }]
# }


## Disclaimer: Avoid running this cell unnecessarily. I have a limited number of API requests (100 requests in total). If I run out, please notify me and I will update the API key. Thanks!

In [None]:
# Removes the formatting text in the response
def clean_sec_text(text):
  cleaned_text = ""
  for line in text.splitlines():
    if line[:2] == "##" or line[:2] == "&#" or line == "":
      continue
    else:
      cleaned_text += line + "\n"

  return cleaned_text

def extract_10k_risk_section(filing_url, extract_mode):
  # get the standardized HTML of section 1A "Risk Factors"
  if extract_mode == "html":
    section_html = extractorApi.get_section(filing_url, "1A", "html")

    section_html = BeautifulSoup(section_html, "html.parser")
    return section_html.prettify()

  else: # get the standardized text of section 1A "Risk Factors"
    section_html = extractorApi.get_section(filing_url, "1A", "text")

    return clean_sec_text(section_html)

def extract_company_risk_documents(company_tickers):
  print(f"Starting download for for {company_tickers}")

  data = []

  query_params = {
    "query": {
        "query_string": {
            "query": f"formType:\"10-K\" AND ticker:({company_tickers})"
        }
    },
    "from": "0",
    "size": "50", # Only looking at the past 5 years, it can easily be modified
    "sort": [{ "filedAt": { "order": "desc" } }]
  }

  response = queryApi.get_filings(query_params)
  # response = example_response
  print(response)

  print(f"{company_tickers} SEC 10-K filing URLs requested.")

  company_data = {}

  for filing in response["filings"]:
      ticker = filing["ticker"]
      if ticker not in company_data:
          company_data[ticker] = []

      filing_url = filing["linkToFilingDetails"]
      risk_section_text = extract_10k_risk_section(filing_url, "text")

      company_data[ticker].append({
          "Ticker": ticker,
          "Company Name": filing["companyName"],
          "URL": filing_url,
          "Fill Date": filing["filedAt"],
          "Risk Factors Text": risk_section_text  # Adding the extracted text
      })

  df_urls = pd.DataFrame(data)
  display(df_urls)

  # Save the csv file
  for ticker, data in company_data.items():
      df_urls = pd.DataFrame(data)
      filename = f"{ticker}_sec10k_risk_corpus.csv"
      df_urls.to_csv(filename, index=False)
      print(f"{ticker} SEC 10-K filing downloaded and saved into {filename}.")

  print("All URLs downloaded and saved into a CSV.")

Example SEC 10k Links:

[XOM](https://www.sec.gov/edgar/browse/?CIK=34088&owner=exclude)
- [Annual Report](https://www.sec.gov/ix?doc=/Archives/edgar/data/34088/000003408824000018/xom-20231231.htm)
- [Quarter at September](https://www.sec.gov/ix?doc=/Archives/edgar/data/34088/000003408823000056/xom-20230930.htm)

In [None]:
# # Example of text extraction
# extract_mode = "text"

# XOM_filing_url = "https://www.sec.gov/Archives/edgar/data/1007019/000149315222002765/form10-k.htm"

# XOM_text = extract_10k_risk_section(XOM_filing_url, extract_mode)
# print(XOM_text)

In [None]:
# My (Andrie) API Key
# Warning: BE CAREFUL, free tier is only 100 api calls :(
api_key = "043d8e779a41b982a6e68a9b2ab148b67505a74716a467618d56829b439f3b98"
extractorApi = ExtractorApi(api_key)
queryApi = QueryApi(api_key=api_key)
company_tickers = "XOM, BP, SHEL, CVX, PXD"

extract_company_risk_documents(company_tickers)

Starting download for for XOM, BP, SHEL, CVX, PXD
{'total': {'value': 94, 'relation': 'eq'}, 'query': {'from': 0, 'size': 50}, 'filings': [{'id': 'e90fbd9f5abf1d7e2286ba9aa22dafb9', 'accessionNo': '0000034088-24-000018', 'cik': '34088', 'ticker': 'XOM', 'companyName': 'EXXON MOBIL CORP', 'companyNameLong': 'EXXON MOBIL CORP (Filer)', 'formType': '10-K', 'description': 'Form 10-K - Annual report [Section 13 and 15(d), not S-K Item 405]', 'filedAt': '2024-02-28T16:37:18-05:00', 'linkToTxt': 'https://www.sec.gov/Archives/edgar/data/34088/000003408824000018/0000034088-24-000018.txt', 'linkToHtml': 'https://www.sec.gov/Archives/edgar/data/34088/000003408824000018/0000034088-24-000018-index.htm', 'linkToXbrl': '', 'linkToFilingDetails': 'https://www.sec.gov/Archives/edgar/data/34088/000003408824000018/xom-20231231.htm', 'entities': [{'companyName': 'EXXON MOBIL CORP (Filer)', 'cik': '34088', 'irsNo': '135409005', 'stateOfIncorporation': 'NJ', 'fiscalYearEnd': '1231', 'type': '10-K', 'act': '

XOM SEC 10-K filing downloaded and saved into XOM_sec10k_risk_corpus.csv.
CVX SEC 10-K filing downloaded and saved into CVX_sec10k_risk_corpus.csv.
PXD SEC 10-K filing downloaded and saved into PXD_sec10k_risk_corpus.csv.
All URLs downloaded and saved into a CSV.
