# U.S. Legal Data Retrieval for Compliance Risk Analysis

This notebook retrieves legal data from U.S. government APIs, such as govinfo.gov and the Federal Register.
The data will be stored in separate directories under `data/` so that it can be processed later in our NLP pipeline.
Our objective is to collect as much high-quality data as possible to serve as a baseline for automated identification of compliance risks.

In [None]:
import os
import requests
import json
from datetime import datetime
import time

## Utility Functions
These helper functions will ensure directories exist and handle data saving with timestamped filenames.

In [None]:
def ensure_dir(directory):
    """Ensure that a directory exists."""
    if not os.path.exists(directory):
        os.makedirs(directory)

def save_data(output_dir, data, filename_prefix):
    """Save JSON data to a file with a timestamp in the filename."""
    ensure_dir(output_dir)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filepath = os.path.join(output_dir, f"{filename_prefix}_{timestamp}.json")
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=4)
    print(f"Data saved to {filepath}")

## API Data Retrieval Functions

Here we define functions to retrieve data from two U.S. sources: govinfo.gov (for U.S. Code, for example) and the Federal Register.
We include pagination logic assuming the API supports an offset/limit strategy.

In [None]:
def retrieve_govinfo_data(api_key, limit=50, max_pages=5):
    """
    Retrieve data from the govinfo.gov API.
    
    Adjust the endpoint and parameters as required by the API documentation.
    This example uses pagination by iterating over pages using an offset.
    """
    base_url = "https://api.govinfo.gov/collections/USCODE"  # Example endpoint: adjust based on real documentation
    all_data = []
    for page in range(max_pages):
        params = {
            "api_key": api_key,
            "offset": page * limit,
            "limit": limit,
        }
        try:
            print(f"Fetching govinfo data: Page {page+1}")
            response = requests.get(base_url, params=params)
            response.raise_for_status()
            data = response.json()
            
            # Assuming the data is in a list under 'results'
            if "results" in data:
                all_data.extend(data["results"])
            else:
                all_data.append(data)
            
            # Optional: add sleep to respect API rate limiting
            time.sleep(1)  
        except requests.RequestException as e:
            print(f"Error retrieving govinfo data on page {page+1}: {e}")
            break

    # Save the aggregated data
    save_data("data/govinfo", all_data, "govinfo_data")
    return all_data

In [None]:
def retrieve_federal_register_data(limit=50, max_pages=5):
    """
    Retrieve data from the Federal Register API.
    
    This example uses pagination if supported (offset strategy).
    Endpoint and parameters should be adapted according to API documentation.
    """
    base_url = "https://www.federalregister.gov/api/v1/documents.json"
    all_data = []
    for page in range(max_pages):
        params = {
            "per_page": limit,
            "page": page + 1,  # Many APIs use page numbers starting at 1
        }
        try:
            print(f"Fetching Federal Register data: Page {page+1}")
            response = requests.get(base_url, params=params)
            response.raise_for_status()
            data = response.json()
            
            # Federal Register API typically returns documents under a key like 'results'
            if "results" in data:
                all_data.extend(data["results"])
            else:
                all_data.append(data)
            
            # Optional: add sleep to respect API rate limiting
            time.sleep(1)  
        except requests.RequestException as e:
            print(f"Error retrieving Federal Register data on page {page+1}: {e}")
            break
            
    # Save the aggregated data
    save_data("data/federal_register", all_data, "federal_register_data")
    return all_data

## Retrieve and Store U.S. Legal Data

Now we call the above functions. You can adjust `max_pages` and `limit` parameters to fetch more data if needed.

Ensure you have your API keys and adjust endpoints as required.

In [None]:
# Replace with your actual govinfo API key.
govinfo_api_key = "YOUR_API_KEY_HERE"

# Retrieve U.S. Code data from govinfo.gov API
govinfo_results = retrieve_govinfo_data(api_key=govinfo_api_key, limit=50, max_pages=5)

In [None]:
# Retrieve data from the Federal Register API
federal_register_results = retrieve_federal_register_data(limit=50, max_pages=5)

## Next Steps

With the U.S. legal data downloaded and stored in the `data/` directory (split into `govinfo` and `federal_register`), you can now proceed with:

- **Data Preprocessing:** Clean and normalize the texts.
- **Annotation:** Label sections indicating missing clauses or ambiguous language by integrating with legal experts’ annotations.
- **NLP Modeling:** Fine-tune NLP models (such as Legal-BERT) or apply a RAG framework using this data to support automated identification of compliance risks.

This pipeline fits the project goals by collecting the baseline legal corpus for subsequent risk analysis while keeping in mind that it does **not** replace expert legal judgment or generate legal documents from scratch.

**End of Notebook**