# SEC EDGAR API Example Notebook

This notebook provides examples for interacting with the SEC EDGAR API. This is using Apple as an example.

## Submissions Endpoint

The submissions endpoint returns a company’s filing history based on their 10-digit CIK.

In [21]:
import requests
import json

SEC_EDGAR = "data/SEC_EDGAR"

# Set the User-Agent header to identify the requesting party.
headers = {
    "User-Agent": "Zain Ali (zali@sandiego.edu)",
    "Accept-Encoding": "gzip, deflate",
    "Host": "data.sec.gov"
}

# Example: Retrieve filing history for Apple Inc.
cik = "0000320193"  # Apple's 10-digit CIK (padded with zeros if necessary)
url = f"https://data.sec.gov/submissions/CIK{cik}.json"

response = requests.get(url, headers=headers)

if response.status_code == 200:
    filings_data = response.json()
    # Optionally, save the data to a file
    with open(SEC_EDGAR+"/apple_filings.json", "w") as f:
        json.dump(filings_data, f, indent=4)
    print("Submissions data retrieved and saved to apple_filings.json")
else:
    print(f"Error retrieving submissions data: {response.status_code}")

Submissions data retrieved and saved to apple_filings.json


## XBRL CompanyConcept Endpoint

This endpoint returns disclosures for a specified company and concept according to the given taxonomy.

In [22]:
headers = {
    "User-Agent": "Zain Ali (zali@sandiego.edu)",
    "Accept-Encoding": "gzip, deflate",
    "Host": "data.sec.gov"
}

# Example: Retrieve the XBRL disclosure for "AccountsPayableCurrent" under "us-gaap" for Apple Inc.
cik = "0000320193"
taxonomy = "us-gaap"
concept = "AccountsPayableCurrent"

url = f"https://data.sec.gov/api/xbrl/companyconcept/CIK{cik}/{taxonomy}/{concept}.json"

response = requests.get(url, headers=headers)

if response.status_code == 200:
    concept_data = response.json()
    with open(SEC_EDGAR+"/apple_accounts_payable_current.json", "w") as f:
        json.dump(concept_data, f, indent=4)
    print("Company concept data retrieved and saved to apple_accounts_payable_current.json")
else:
    print(f"Error retrieving company concept data: {response.status_code}")

Company concept data retrieved and saved to apple_accounts_payable_current.json


## XBRL CompanyFacts Endpoint

This endpoint aggregates all company XBRL concept data into a single JSON object.

In [23]:
headers = {
    "User-Agent": "Zain Ali (zali@sandiego.edu)",
    "Accept-Encoding": "gzip, deflate",
    "Host": "data.sec.gov"
}

# Example: Retrieve complete XBRL facts data for Apple Inc.
cik = "0000320193"
url = f"https://data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json"

response = requests.get(url, headers=headers)

if response.status_code == 200:
    company_facts = response.json()
    with open(SEC_EDGAR+"/apple_company_facts.json", "w") as f:
        json.dump(company_facts, f, indent=4)
    print("Company facts data retrieved and saved to apple_company_facts.json")
else:
    print(f"Error retrieving company facts data: {response.status_code}")

Company facts data retrieved and saved to apple_company_facts.json


## XBRL Frames Endpoint

This endpoint returns frame data by aggregating one fact per filing for a specific concept and period.

In [24]:
headers = {
    "User-Agent": "Zain Ali (zali@sandiego.edu)",
    "Accept-Encoding": "gzip, deflate",
    "Host": "data.sec.gov"
}

# Example: Retrieve frame data for "AccountsPayableCurrent" (us-gaap) in USD for period CY2019Q1I.
taxonomy = "us-gaap"
concept = "AccountsPayableCurrent"
unit = "USD"
period = "CY2019Q1I"  # Instantaneous data for Q1 of 2019

url = f"https://data.sec.gov/api/xbrl/frames/{taxonomy}/{concept}/{unit}/{period}.json"

response = requests.get(url, headers=headers)

if response.status_code == 200:
    frame_data = response.json()
    with open(SEC_EDGAR+"/accounts_payable_frame.json", "w") as f:
        json.dump(frame_data, f, indent=4)
    print("Frame data retrieved and saved to accounts_payable_frame.json")
else:
    print(f"Error retrieving frame data: {response.status_code}")

Frame data retrieved and saved to accounts_payable_frame.json


# federalregister.gov Docs

In [25]:
import os
import requests
import json
import datetime


federalregister = "data/federalregister"

def save_response_to_file(test_name, data):
    """
    Saves the JSON data to a file in a directory named after the test.
    Each file is timestamped.
    """
    # Define the base directory
    base_dir = federalregister
    # Create a test-specific directory
    test_dir = os.path.join(base_dir, test_name)
    os.makedirs(test_dir, exist_ok=True)
    
    # Create a filename using the current timestamp
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = os.path.join(test_dir, f"response_{timestamp}.json")
    
    # Save the data as a JSON file
    with open(filename, "w") as f:
        json.dump(data, f, indent=2)
    
    print(f"Response saved to {filename}")

def query_federal_register(search_term, per_page=5):
    """
    Queries the Federal Register API using a search term.
    Adjusted to avoid invalid parameters.
    """
    base_url = "https://www.federalregister.gov/api/v1/documents.json"
    
    # Use 'q' for a general text search instead of using conditions[topics][] 
    params = {
        "q": search_term,
        "per_page": per_page,
        "order": "newest"
    }
    
    try:
        response = requests.get(base_url, params=params)
        response.raise_for_status()  # Raises an error for status codes 4xx/5xx
        data = response.json()
        print("API call successful.")
        return data
        
    except requests.HTTPError as http_err:
        # Print HTTP error details and return an error dictionary
        print(f"HTTP error occurred: {http_err}")
        try:
            error_data = response.json()
        except Exception:
            error_data = {"error": "Failed to parse error details"}
        return error_data
    except Exception as err:
        # Catch any other exceptions
        print(f"An error occurred: {err}")
        return {"error": str(err)}

if __name__ == "__main__":
    # Example usage:
    tests = [
        {"name": "environment_search", "search_term": "environment"},
        {"name": "health_search", "search_term": "healthcare"}
    ]
    
    for test in tests:
        print(f"Performing test: {test['name']} with search term: {test['search_term']}")
        result = query_federal_register(test['search_term'], per_page=5)
        # Save the result for this test in its own directory
        save_response_to_file(test['name'], result)


Performing test: environment_search with search term: environment
API call successful.
Response saved to data/federalregister/environment_search/response_20250121_001825.json
Performing test: health_search with search term: healthcare
API call successful.
Response saved to data/federalregister/health_search/response_20250121_001825.json


# U.S. Legal Data Retrieval Using the govinfo API

**Overview:**  
This notebook demonstrates how to programmatically retrieve legal documents and related metadata from the govinfo API.  

**API Key Setup:**  
To use the govinfo API you must register for an API key at [https://www.govinfo.gov/api-signup](https://www.govinfo.gov/api-signup). After registration, replace `<YOUR_API_KEY_HERE>` in the code with your actual API key.

**Project Goals:**  
- Automate the retrieval of U.S. legal data for compliance risk analysis.  
- Collect and store data to support automated identification of missing clauses, ambiguous language, or other compliance risks.

**Non-Goals:**  
- The notebook does not generate legal documents or replace legal professionals’ judgment.

**Data Storage:**  
Downloaded files are saved in a structured directory under `data/` for further preprocessing and ML analysis.

## Utility Functions

Create directories if they do not exist and save JSON data with timestamped filenames.

In [29]:
govinfo = "data/govinfo"

def ensure_dir(directory):
    """Ensure that a directory exists."""
    if not os.path.exists(directory):
        os.makedirs(directory)

def save_data(output_dir, data, filename_prefix):
    """Save JSON data into a file with a timestamp in the filename."""
    ensure_dir(output_dir)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filepath = os.path.join(output_dir, f"{filename_prefix}_{timestamp}.json")
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=4)
    print(f"Data saved to {filepath}")

## API Key Setup

Replace `<YOUR_API_KEY_HERE>` with your actual API key.

**Getting an API Key:**  
Visit [https://www.govinfo.gov/api-signup](https://www.govinfo.gov/api-signup) and follow the instructions to sign up. Once you receive your API key, update the code below.


In [11]:
API_KEY = ""


## Sample 1: Retrieve Collections

The govinfo API allows you to retrieve a list of collections. For example, the **Collections Service** endpoint returns all available collections along with metadata such as `collectionCode`, `collectionName`, `packageCount`, and `granuleCount`.

**Example request:**  
```
https://api.govinfo.gov/collections?api_key=API_KEY
```

In [31]:
import os
import requests
import json
import time
import datetime  # Import the full datetime module

def ensure_dir(directory):
    """Ensure that a directory exists."""
    if not os.path.exists(directory):
        os.makedirs(directory)

def save_data(output_dir, data, filename_prefix):
    """Save JSON data into a file with a timestamp in the filename."""
    ensure_dir(output_dir)
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    filepath = os.path.join(output_dir, f"{filename_prefix}_{timestamp}.json")
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=4)
    print(f"Data saved to {filepath}")

def retrieve_collections(api_key):
    """
    Retrieve the list of collections available from the govinfo API.
    """
    base_url = "https://api.govinfo.gov/collections"
    params = {
        "api_key": api_key
    }
    try:
        print("Fetching collections list...")
        response = requests.get(base_url, params=params)
        response.raise_for_status()
        data = response.json()
        save_data(os.path.join(govinfo,"collections"), data, "collections")
        return data
    except requests.RequestException as e:
        print(f"Error retrieving collections data: {e}")
        return None

# Retrieve collections
collections_data = retrieve_collections(API_KEY)


Fetching collections list...
Data saved to data/govinfo/collections/collections_20250121_002155.json



## Sample 2: Retrieve Collection Updates (e.g., Congressional Bills)

Use the Collections Service endpoint to fetch package IDs that have been added or modified. In this example, we fetch data for the **BILLS** collection with a specified start date.

The URL format is:  
```
https://api.govinfo.gov/collections/BILLS/<lastModifiedStartDate>/[<lastModifiedEndDate>?]offsetMark=*&pageSize=<pageSize>&api_key=API_KEY
```

We will use the `offsetMark` for pagination. The initial value is `*` and each response provides the `nextPage` offsetMark.


In [33]:
import os
import requests
import json
import time
import datetime  # Import the full datetime module

def ensure_dir(directory):
    """Ensure that a directory exists."""
    if not os.path.exists(directory):
        os.makedirs(directory)

def save_data(output_dir, data, filename_prefix):
    """Save JSON data into a file with a timestamp in the filename."""
    ensure_dir(output_dir)
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    filepath = os.path.join(output_dir, f"{filename_prefix}_{timestamp}.json")
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=4)
    print(f"Data saved to {filepath}")

def retrieve_collection_updates(collection_code, start_date, end_date=None, page_size=100):
    """
    Retrieve package IDs for a specific collection (e.g., BILLS) updated after start_date.
    You can optionally provide an end_date.
    
    Parameters:
    - collection_code: e.g., "BILLS"
    - start_date: in ISO8601 format, e.g., "2023-01-01T00:00:00Z"
    - end_date: in ISO8601 format or None
    - page_size: number of records per request (max 1000)
    """
    base_url = f"https://api.govinfo.gov/collections/{collection_code}"
    
    # Build the URL based on whether end_date is provided
    if end_date:
        base_url = f"{base_url}/{start_date}/{end_date}"
    else:
        base_url = f"{base_url}/{start_date}"
    
    offset_mark = "*"
    all_packages = []
    iteration = 0
    
    while offset_mark:
        params = {
            "offsetMark": offset_mark,
            "pageSize": page_size,
            "api_key": API_KEY
        }
        try:
            iteration += 1
            print(f"Fetching page {iteration} for collection '{collection_code}'...")
            response = requests.get(base_url, params=params)
            response.raise_for_status()
            data = response.json()
            
            # Append package data to the list (assumes data might be under a 'packages' key)
            if "packages" in data:
                all_packages.extend(data["packages"])
            else:
                all_packages.append(data)
                
            # Get the next offsetMark from the response if available
            next_offset = data.get("nextPage")
            if next_offset and next_offset != offset_mark:
                offset_mark = next_offset
            else:
                offset_mark = None

            # Respect rate limits: slight sleep between requests.
            time.sleep(1)
        except requests.RequestException as e:
            print(f"Error on page {iteration}: {e}")
            break

    # Save collected updates to data/collection_updates/{collection_code}
    output_dir = os.path.join(govinfo, "collection_updates", collection_code)
    save_data(output_dir, all_packages, f"{collection_code}_updates")
    return all_packages


# Example: Retrieve updates for Congressional Bills (BILLS) starting from January 1, 2023
bills_updates = retrieve_collection_updates("BILLS", "2023-01-01T00:00:00Z")


Fetching page 1 for collection 'BILLS'...
Fetching page 2 for collection 'BILLS'...
Error on page 2: 500 Server Error: Internal Server Error for url: https://api.govinfo.gov/collections/BILLS/2023-01-01T00:00:00Z?offsetMark=https%3A%2F%2Fapi.govinfo.gov%2Fcollections%2FBILLS%2F2023-01-01T00%3A00%3A00Z%3FoffsetMark%3DAoJw2qnltJQDMkJJTExTLTExOGhyMTAzNjNpaA%253D%253D%26pageSize%3D100&pageSize=100&api_key=C5a4DywZ0kYJYQheZQLalaRhgaPTz4fyaiap1WOG
Data saved to data/govinfo/collection_updates/BILLS/BILLS_updates_20250121_002345.json



## Sample 3: Retrieve Package Summaries

Once you have a package ID (for example, from the previous collection update), you can retrieve detailed metadata for that package using the **Packages Service**.

**Example URL:**  
```
https://api.govinfo.gov/packages/BILLS-115hr1625enr/summary?api_key=API_KEY
```

The function below retrieves a package summary for a given package ID.

In [32]:
def retrieve_package_summary(package_id):
    """
    Retrieve package summary for the given package_id.
    """
    base_url = f"https://api.govinfo.gov/packages/{package_id}/summary"
    params = {
        "api_key": API_KEY
    }
    try:
        print(f"Fetching summary for package: {package_id}")
        response = requests.get(base_url, params=params)
        response.raise_for_status()
        data = response.json()
        
        # Save package summary into its own folder
        output_dir = os.path.join(govinfo, f"package_summaries/{package_id}")
        save_data(output_dir, data, "package_summary")
        return data
    except requests.RequestException as e:
        print(f"Error retrieving summary for {package_id}: {e}")
        return None

# Retrieve a package summary example.
# Replace 'BILLS-115hr1625enr' with an actual package id obtained from the collection updates if available.
package_summary = retrieve_package_summary("BILLS-115hr1625enr")


Fetching summary for package: BILLS-115hr1625enr
Data saved to data/govinfo/package_summaries/BILLS-115hr1625enr/package_summary_20250121_002224.json
