# Collect the comments on a single docket

This notebook allows you to collect the data and metadata on all the comments on a single docket from the Regulations.gov API. It also extracts the text from comments submitted as PDFs and structure the data into a CSV file with the comments and metadata.

We might already have collected the comments on the docket you are looking for. In that case, you can search for the docket on [www.commons-project.com/dockets](https://www.commons-project.com/dockets).

- Mention bulk download of comments from Regulations.gov API and the use and limitations on that!


#### Define `docket_id`

You can find the exact docket ID on [regulations.gov](https://www.regulations.gov/).

Example:

```bash
docket_id = "EPA-HQ-OLEM-2023-0278"
```

In [20]:
docket_id = "EPA-HQ-OLEM-2023-0278"

## Setup

### Load in the necessary libraries

In [33]:
import pandas as pd
import html
import json
from flatten_json import flatten
import math
import os
import subprocess
import time
import datetime as dt
from datetime import date, datetime, timedelta
from glob import glob
from dotenv import load_dotenv
from io import BytesIO

import boto3
import botocore
import pdfplumber
import psycopg2
import PyPDF2
import pytz
import requests
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTChar, LTTextContainer

### API keys

Before you can run this notebook, you need to get an API key from regulations.gov. You can get one by going to https://open.gsa.gov/api/regulationsgov/ and clicking on the "Get API Key" button. Once you have the key, you can load it into the `.env` file.

```bash
echo "REGULATIONS_GOV_API_KEY=your-key-here" > .env
```

Load in the regulations.gov API keys

In [22]:
load_dotenv()
api_key = os.getenv("REGGOV_API_KEY")
extra_api_key = os.getenv("REGGOV_API_KEY_N1")

## Step 2: Collect the comments

### 2.1 Get all the comment ids

In [23]:
comment_ids = []
for page in range(1, 20):
    url = f"https://api.regulations.gov/v4/comments?filter[docketId]={docket_id}&page[size]=250&page[number]={page}&sort=lastModifiedDate&api_key={api_key}"
    response = requests.get(url)
    result = response.json()
    for item in result["data"]:
        comment_ids.append(item['id'])

api_response = []

if result["data"] is None or not result["data"]:
    print("Seems like there are less than 250 comments in this docket.")
else:
    api_response.append(result)

# If there are more than 250 comments, we need to make additional API calls to get all the comments
if result['meta']['totalElements'] > 250:

    # Reset the time variables to get the last modified date of the last comment in the first API call. Thus we can use this date to get the next batch of comments
    greater_than = api_response[-1]["data"][-1]["attributes"]["lastModifiedDate"][:-1]
    greater_than = greater_than.replace("T", " ")
    date_str = greater_than
    date_format = "%Y-%m-%d %H:%M:%S"

    date_obj = datetime.strptime(date_str, date_format)
    greater_than = date_obj - timedelta(
        hours=5
    )

    for page in range(1, 20):
        url = f"https://api.regulations.gov/v4/comments?filter[docketId]={docket_id}&filter[lastModifiedDate][ge]={greater_than}&page[size]=250&page[number]={page}&sort=lastModifiedDate&api_key={api_key}"
        response = requests.get(url)
        result = response.json()
        for item in result["data"]:
            comment_ids.append(item['id'])

Seems like there are less than 250 comments in this docket.


In [24]:
print(f"Collected the IDs for {len(comment_ids)} comments")

Collected the IDs for 102 comments


### 2.2 Get the metadata for each comment

In [25]:
keys = ["L1", "L2", "L3", "L4", "L5", "J1", "J2", "J3", "J4", "J5", "M1", "M2", "M3", "M4", "M5"]

In [26]:
# Figure out how many rounds it will take to scrape all the comments
rounds = len(comment_ids) / 500
rounds = math.ceil(rounds)

comment_details = []

# We scrape the comments using their index number in the ids list
num = 0

for round in range(rounds):
    for key in keys:
        api_key = os.getenv(f"REGGOV_API_KEY_{key}")
        for i in range(50):
            # In the last round, there might not be 50 comments left, so we break when we run out of comments to scrape
            try:
                comment_id = comment_ids[num]
                docket_id = comment_id[:-5]
                url = f"https://api.regulations.gov/v4/comments/{comment_id}?include=attachments&api_key={api_key}"
                response = requests.get(url)
                result = response.json()

                # And append the data to the today_comments list so we can access the download links and store the pdfs
                comment_details.append(result)
            
            except Exception as e:
                    break

            num = num + 1

            # Sleep for 0.4 seconds to avoid hitting the API rate limit
            time.sleep(0.4)

### 2.3 Extract the text from comments stored as PDFs
Extract the files

In [27]:
def get_comment_text(comments):

    # Function to extract text
    def text_extraction(element):
        # Extracting the text from the in-line text element
        line_text = element.get_text()

        # Find the formats of the text
        # Initialize the list with all the formats that appeared in the line of text
        line_formats = []
        for text_line in element:
            if isinstance(text_line, LTTextContainer):
                # Iterating through each character in the line of text
                for character in text_line:
                    if isinstance(character, LTChar):
                        # Append the font name of the character
                        line_formats.append(character.fontname)
                        # Append the font size of the character
                        line_formats.append(character.size)
        # Find the unique font sizes and names in the line
        format_per_line = list(set(line_formats))

        # Return a tuple with the text in each line along with its format
        return (line_text, format_per_line)

    # Loop through the comments and extract the text from the attached pdfs
    for comment in comments:
        id = comment["data"]["id"]
        try:
            num = 1
            attachment_url = ""
            for files in comment["included"]:
                result = ""
                # Loop through the files and get the pdfs if they exist
                if files["attributes"]["fileFormats"] is not None:
                    for file in files["attributes"]["fileFormats"]:
                        try:
                            url = file["fileUrl"]
                            response = requests.get(url)
                            attachment_url = attachment_url + str(url) + " "

                            # pdf_path = f"{id}_attachment_{num}.pdf"
                            pdf_path = os.path.abspath(f"{id}_attachment_{num}.pdf")
                            doc_path = os.path.abspath("temp.docx")
                            soffice_path = (
                                "/Applications/LibreOffice.app/Contents/MacOS/soffice"
                            )

                            if url.endswith(".docx"):
                                with open(doc_path, "wb") as f:
                                    f.write(response.content)
                                subprocess.run(
                                    [
                                        soffice_path,
                                        "--convert-to",
                                        "pdf",
                                        "--headless",
                                        doc_path,
                                    ]
                                )
                                os.rename("temp.pdf", pdf_path)
                                os.remove("temp.docx")

                            else:
                                with open(pdf_path, "wb") as f:
                                    f.write(response.content)

                            # ADD PDF SCRAPER HERE
                            pdfFileObj = open(pdf_path, "rb")
                            pdfReaded = PyPDF2.PdfReader(pdfFileObj)

                            # Get the number of pages in the PDF file
                            num_pages = len(pdfReaded.pages)

                            # Create the dictionary to extract text from each image
                            text_per_page = {}

                            # We extract the pages from the PDF
                            for pagenum, page in enumerate(extract_pages(pdf_path)):
                                if pagenum > 2:
                                    break
                                # Initialize the variables needed for the text extraction from the page
                                pageObj = pdfReaded.pages[pagenum]
                                page_text = []
                                line_format = []
                                page_content = []

                                # Open the pdf file
                                pdf = pdfplumber.open(pdf_path)

                                # Find the examined page
                                page_tables = pdf.pages[pagenum]

                                # Find all the elements
                                page_elements = [
                                    (element.y1, element) for element in page._objs
                                ]

                                # Sort all the elements as they appear in the page
                                page_elements.sort(key=lambda a: a[0], reverse=True)

                                # Find the elements that composed a page
                                for i, component in enumerate(page_elements):
                                    # Extract the position of the top side of the element in the PDF
                                    pos = component[0]

                                    # Extract the element of the page layout
                                    element = component[1]

                                    # Check if the element is a text element
                                    if isinstance(element, LTTextContainer):
                                        # Use the function to extract the text and format for each text element
                                        (line_text, format_per_line) = text_extraction(
                                            element
                                        )

                                        # Append the text of each line to the page text
                                        page_text.append(line_text)

                                        # Append the format for each line containing text
                                        line_format.append(format_per_line)
                                        page_content.append(line_text)
                                    # Create the key of the dictionary
                                    dctkey = "Page_" + str(pagenum)

                                    # Add the list of list as the value of the page key
                                    text_per_page[dctkey] = [
                                        page_text,
                                        line_format,
                                        page_content,
                                    ]

                                # Display the content of the page
                                page_result = "".join(
                                    text_per_page["Page_" + str(pagenum)][0]
                                )
                                result = result + "\n \n" + page_result

                                # Close the pdf file
                                pdfFileObj.close()

                                # Remove pdf files that are not needed anymore
                            try:
                                os.remove(pdf_path)
                                files = glob(f"*.pdf")
                                for f in files:
                                    os.remove(f)
                            except:
                                print("No files to remove")

                            num = num + 1

                            # Save the extracted text to the json file from the api call
                            comment["data"]["attributes"]["pdf_extracted_text"] = result
                            comment["data"]["attributes"][
                                "attachment_read"
                            ] = "attachment extracted"
                            comment["data"]["attributes"]["attachments_url"] = attachment_url

                        except Exception as inst:
                            print(type(inst))  # the exception type
                            x = inst.args  # unpack args
                            print("x =", x)
                            comment["data"]["attributes"][
                                "attachment_read"
                            ] = "attachment failed"
                            comment["data"]["attributes"]["attachments_url"] = attachment_url
                        except:
                            comment["data"]["attributes"][
                                "attachment_read"
                            ] = "attachment failed"
                            comment["data"]["attributes"]["attachments_url"] = attachment_url
                            raise
                else:
                    comment["data"]["attributes"]["attachment_read"] = "no attachment"
                    comment["data"]["attributes"]["attachments_url"] = None
        except KeyError:
            comment["data"]["attributes"]["attachment_read"] = "no attachment"
            comment["data"]["attributes"]["attachments_url"] = None
    return comments

Call the pdf extraction function

In [None]:
full_data = get_comment_text(comment_details)

## Step 3: Reshape and save the data

In [28]:
def structure_data(data, keys_to_include):
    final_data = []
    for comment in data:
        # Flatten the nested dictionaries
        flat_comment = flatten(comment)
        result_dict = {}
        for key, value in flat_comment.items():
            if key in keys_to_include:
                if isinstance(value, dict):
                    # Recursively process nested dictionaries
                    result_dict[key] = structure_data(value, keys_to_include)
                else:
                    # Include non-dictionary values
                    result_dict[key] = value if value is not None else ""
        final_data.append(result_dict)

    # Rename the keys to match the database
    key_mapping = {
        "data_id": "comment_id",
        "data_attributes_commentOnDocumentId": "document_id",
        "data_attributes_docketId": "docket_id",
        "data_attributes_agencyId": "agency_id",
        "data_attributes_title": "title",
        "data_attributes_comment": "comment",
        "data_attributes_pdf_extracted_text": "comment_pdf_extracted",
        "data_attributes_firstName": "commenter_first_name",
        "data_attributes_lastName": "commenter_last_name",
        "data_attributes_organization": "commenter_organization",
        "data_attributes_address1": "commenter_address1",
        "data_attributes_address2": "commenter_address2",
        "data_attributes_zip": "commenter_zip",
        "data_attributes_city": "commenter_city",
        "data_attributes_stateProvinceRegion": "commenter_state_province_region",
        "data_attributes_country": "commenter_country",
        "data_attributes_email": "commenter_email",
        "data_attributes_receiveDate": "receive_date",
        "data_attributes_postedDate": "posted_date",
        "data_attributes_postmarkDate": "postmark_date",
        "data_attributes_duplicateComments": "duplicate_comments",
        "data_attributes_attachment_read": "attachment_read",
        "data_attributes_attachments_url": "attachment_url",
        "data_attributes_withdrawn": "withdrawn",
        "data_links_self": "api_url",
    }

    # Rename the keys so they match the database
    for i in final_data:
        for old_key, new_key in key_mapping.items():
            i[new_key] = i.pop(old_key, "")
    return final_data

In [29]:
# Specify the keys to include in the resulting dictionary
keys_to_include = [
    "data_id",
    "data_attributes_commentOnDocumentId",
    "data_attributes_docketId",
    "data_attributes_agencyId",
    "data_attributes_title",
    "data_attributes_comment",
    "data_attributes_pdf_extracted_text",
    "data_attributes_firstName",
    "data_attributes_lastName",
    "data_attributes_organization",
    "data_attributes_address1",
    "data_attributes_address2",
    "data_attributes_zip",
    "data_attributes_city",
    "data_attributes_country",
    "data_attributes_stateProvinceRegion",
    "data_attributes_email",
    "data_attributes_receiveDate",
    "data_attributes_postedDate",
    "data_attributes_postmarkDate",
    "data_links_self",
    "data_attributes_attachments_url",
    "data_attributes_attachment_read",
    "data_attributes_duplicateComments",
    "data_attributes_withdrawn",
]



convert /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.docx as a Writer document -> /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.pdf using filter : writer_pdf_Export




convert /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.docx as a Writer document -> /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.pdf using filter : writer_pdf_Export




convert /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.docx as a Writer document -> /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.pdf using filter : writer_pdf_Export


Multiple definitions in dictionary at byte 0x146ee8 for key /Info
Multiple definitions in dictionary at byte 0x146ef5 for key /Info
Multiple definitions in dictionary at byte 0x146f02 for key /Info


convert /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.docx as a Writer document -> /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.pdf using filter : writer_pdf_Export




convert /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.docx as a Writer document -> /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.pdf using filter : writer_pdf_Export




convert /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.docx as a Writer document -> /Users/laurabejderjensen/Desktop/Github/commons-project/data_collection/notebooks/temp.pdf using filter : writer_pdf_Export
<class 'KeyError'>
x = ('Page_2',)


Process the nested JSON data

In [None]:
result = structure_data(full_data, keys_to_include)

In [30]:
def clean_string(input_string):
    """Remove NULL characters from a string."""
    if input_string is not None:
        # clean up html - list all the html characters that need to be changed and what they should be changed to
        html_chars = {
            "&amp;": "&",
            "&gt;": ">",
            "&lt;": "<",
            "&nbsp;": " ",
            "&quot;": '"',
            "&#39;": "'",
            "&#34;": '"',
            "nan": "",
            "<br>": " ",
            "<br/>": " ",
            "\n": " ",
            "\x00": "",
        }
        for key, value in html_chars.items():
            input_string = input_string.replace(key, value)

        input_string = input_string.replace("See Attached", "")
        input_string = input_string.replace("See attached file(s)", "")
        input_string = html.unescape(input_string)

    return input_string


# CREATE THE FULL TEXT AND CLEAN_TEXT COLUMNS:
for item in result:
    # Create a new column that combines the comment and the extracted text from the pdf
    item["full_text"] = item["comment"] + " " + item["comment_pdf_extracted"]

### 3.2 Structure the data into a CSV file

In [34]:
df = pd.DataFrame(result)

In [35]:
df.to_csv(f"{docket_id}_comments.csv", index=False)

Unnamed: 0,comment_id,document_id,docket_id,agency_id,title,comment,comment_pdf_extracted,commenter_first_name,commenter_last_name,commenter_organization,...,commenter_email,receive_date,posted_date,postmark_date,duplicate_comments,attachment_read,attachment_url,withdrawn,api_url,full_text
0,EPA-HQ-OLEM-2023-0278-0183,EPA-HQ-OLEM-2023-0278-0001,EPA-HQ-OLEM-2023-0278,EPA,Anonymous public comment,I strongly support and encourage the addition ...,\n \nNTP\nNational Toxicology Program\nU.S. De...,,,,...,,2024-02-10T05:00:00Z,2024-02-13T05:00:00Z,2024-02-10T05:00:00Z,1,attachment extracted,https://downloads.regulations.gov/EPA-HQ-OLEM-...,False,https://api.regulations.gov/v4/comments/EPA-HQ...,I strongly support and encourage the addition ...
1,EPA-HQ-OLEM-2023-0278-0180,EPA-HQ-OLEM-2023-0278-0001,EPA-HQ-OLEM-2023-0278,EPA,Anonymous public comment,My company is small (appx. 180 employees) and ...,,,,,...,,2024-02-08T05:00:00Z,2024-02-09T05:00:00Z,2024-02-08T05:00:00Z,1,no attachment,,False,https://api.regulations.gov/v4/comments/EPA-HQ...,My company is small (appx. 180 employees) and ...
2,EPA-HQ-OLEM-2023-0278-0182,EPA-HQ-OLEM-2023-0278-0001,EPA-HQ-OLEM-2023-0278,EPA,Comment submitted by Bernard Wlodarski,it&#39;s about time these chemicals were banne...,,,,,...,,2024-02-09T05:00:00Z,2024-02-09T05:00:00Z,2024-02-09T05:00:00Z,1,no attachment,,False,https://api.regulations.gov/v4/comments/EPA-HQ...,it&#39;s about time these chemicals were banne...
3,EPA-HQ-OLEM-2023-0278-0181,EPA-HQ-OLEM-2023-0278-0001,EPA-HQ-OLEM-2023-0278,EPA,Comment submitted by Delaney Moran,This law proposes the addition of nine Per- an...,,,,,...,,2024-02-08T05:00:00Z,2024-02-09T05:00:00Z,2024-02-08T05:00:00Z,1,no attachment,,False,https://api.regulations.gov/v4/comments/EPA-HQ...,This law proposes the addition of nine Per- an...
4,EPA-HQ-OLEM-2023-0278-0184,EPA-HQ-OLEM-2023-0278-0001,EPA-HQ-OLEM-2023-0278,EPA,Anonymous public comment,PFAS must be included in the List of Hazardous...,,,,,...,,2024-02-11T05:00:00Z,2024-02-15T05:00:00Z,2024-02-11T05:00:00Z,1,no attachment,,False,https://api.regulations.gov/v4/comments/EPA-HQ...,PFAS must be included in the List of Hazardous...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,EPA-HQ-OLEM-2023-0278-0286,EPA-HQ-OLEM-2023-0278-0001,EPA-HQ-OLEM-2023-0278,EPA,Comment submitted by State of Utah Division of...,See Attached,\n \nDepartment of\nEnvironmental Quality\nKim...,,,,...,,2024-04-02T04:00:00Z,2024-04-12T04:00:00Z,2024-03-21T04:00:00Z,1,attachment extracted,https://downloads.regulations.gov/EPA-HQ-OLEM-...,False,https://api.regulations.gov/v4/comments/EPA-HQ...,See Attached \n \nDepartment of\nEnvironmental...
98,EPA-HQ-OLEM-2023-0278-0288,EPA-HQ-OLEM-2023-0278-0001,EPA-HQ-OLEM-2023-0278,EPA,Mass Comment Campaign sponsored by U.S. PIRG (...,<br/><br/>Re: Listing of Specific Per- and Pol...,\n \nFirst Name\nLast Name\nCity\nState\nZIP C...,,,,...,,2024-04-08T04:00:00Z,2024-04-15T04:00:00Z,2024-04-08T04:00:00Z,10510,attachment extracted,https://downloads.regulations.gov/EPA-HQ-OLEM-...,False,https://api.regulations.gov/v4/comments/EPA-HQ...,<br/><br/>Re: Listing of Specific Per- and Pol...
99,EPA-HQ-OLEM-2023-0278-0287,EPA-HQ-OLEM-2023-0278-0001,EPA-HQ-OLEM-2023-0278,EPA,Mass Comment Campaign sponsoring organization ...,I&rsquo;m writing to support the U.S. Environm...,,,,,...,,2024-03-27T04:00:00Z,2024-04-15T04:00:00Z,2024-03-27T04:00:00Z,1,no attachment,,False,https://api.regulations.gov/v4/comments/EPA-HQ...,I&rsquo;m writing to support the U.S. Environm...
100,EPA-HQ-OLEM-2023-0278-0290,EPA-HQ-OLEM-2023-0278-0001,EPA-HQ-OLEM-2023-0278,EPA,Comment submitted by National PFAS Contaminati...,Please find the attached appendices containing...,\n \nAnn Occup Environ Med. 2023 Mar 15;35:e5\...,,,,...,,2024-04-08T04:00:00Z,2024-04-15T04:00:00Z,2024-04-08T04:00:00Z,1,attachment extracted,https://downloads.regulations.gov/EPA-HQ-OLEM-...,False,https://api.regulations.gov/v4/comments/EPA-HQ...,Please find the attached appendices containing...
