Query tables from plain text files #8

leftmove · 2024-04-03T23:04:13Z

The most recent addition to wallstreetlocal was the ability to query XML files along with HTML files. The only format remaining to code in now, is plain text (TXT).

The SEC's XML and HTML stocks were barely structured enough to be queried accurately, but TXT provides an even harder challenge. The problem is the inconsistency. While tables in TXT can be read fairly easily by human eyes, they are too disimilar to query effectively.

Here are some minified examples.

<TABLE>            <C>                                              <C>
                                            FO     RM 13F IFORMATIONTABLE
                                            VALUE  SHARES/ SH/ PUT/ INVSTMT  OTHER  VOT    ING AUTRITY
NAME OF ISSUER     TITLE OF CLASS CUSIP     (X1000)PRN AMT PRN CALL DSCRETN  MANAGERSOLE   SHARED NONE
------------------------------------------- ------------------ ---- -------  -----------------------------
AFLAC INC          COMMON STOCK   001055102       3      71SH       DEFINED       71      0      0
AGL RESOURCES INC  COMMON STOCK   001204106     123    3025SH       DEFINED     3025      0      0
ABBOTT LABS COM    COMMON STOCK   002824100    1606   30519SH       DEFINED    27798   2721      0
ABERCROMBIE & FITCHCOMMON STOCK   002896207       0       2SH       DEFINED        2      0      0
AIR PRODUCTS & CHEMCOMMON STOCK   009158106   16728  175017SH       DEFINED   140030   2282  32705
AIRGAS INC         COMMON STOCK   009363102       4      52SH       DEFINED       52      0      0
</TABLE>

<TABLE>                   <C>   <C>        <C>       <C>                <C>   <C>   <C>
                                             VALUE                       INV.  OTH   vtng
      NAME OF ISSUER      CLASS    CUSIP    (x$1000)       SHARES        disc  MGRS  AUTH

Albertson College of Idaho Large Growth
ADC TELECOMMUNICATIO      COMM  000886101         $18         337.00     Sole  N/A   Sole
AFLAC INC                 COMM  001055102         $14         298.00     Sole  N/A   Sole
AES CORP                  COMM  00130H105         $18         232.00     Sole  N/A   Sole
AXA FINL INC              COMM  002451102         $18         504.00     Sole  N/A   Sole
ABBOTT LABS               COMM  002824100         $61       1,724.00     Sole  N/A   Sole
ABERCROMBIE & FITCH       COMM  002896207          $2         114.00     Sole  N/A   Sole
</TABLE>

<TABLE>
                                                             VALUE    SHARES/ SH/ PUT/ INVSTMT            -----VOTING AUTHORITY-----
  NAME OF ISSUER                 -TITLE OF CLASS- --CUSIP-- (X$1000)  PRN AMT PRN CALL DSCRETN -MANAGERS-     SOLE   SHARED     NONE
                                 <C>                                              <C>
D DAIMLERCHRYSLER AG             ORD              D1668R123        5      112 SH       DEFINED 05              112        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123       31      748 SH       DEFINED 05              748        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123        5      130 SH       DEFINED 06              130        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123      246     5894 SH       DEFINED 14             3089        0     2805
D DAIMLERCHRYSLER AG             ORD              D1668R123      118     2832 SH       DEFINED 14             2104      604      124
D DAIMLERCHRYSLER AG             ORD              D1668R123        8      200 SH       DEFINED 29              200        0        0
D DAIMLERCHRYSLER AG             ORD              D1668R123       63     1510 SH       DEFINED 41                0        0     1510
</TABLE>

The column sizes, names, and overall formatting of each table changes too often for any meanginful code to be written. Without writing a gargantuan amount of code, or using AI (which is expensive), there doesn't seem to be much way to query stocks like this.

There should be a better, more effective method to taking the TXT tables, and creating usable, structured data. If you are taking on this issue, there is no need to create pull requests. Instead, submit some code samples that can be used, and I will transfer you to the back-end repository where you can submit your changes.

The text was updated successfully, but these errors were encountered:

ahiddenproxy · 2024-04-20T17:31:05Z

I'd be happy to help you take this on. I've faced similar data inconsistencies when building tooling for real estate projects

leftmove · 2024-05-03T11:12:44Z

@ahiddenproxy I have assigned you, please let me know if you need any additional context!

Thank you!

(Sorry for the late reply)

leftmove · 2024-05-03T11:15:51Z

I should also mention, you don't need to worry about the pedantic details of the source code, if you don't wish to.

I can point you to the function where text input is given as a table, and you can work from there on creating a function that returns structured data.

As long some sort of querying function is created to turn the table into structured data, I can implement said function into the code so it works with everything else properly.

leftmove · 2024-05-03T12:42:22Z

Here is the relevant code.

def scrape_txt(cik, filing, directory):
    pass


def scrape_html(cik, filing, directory, empty=False):
    data = api.sec_directory_search(cik, directory)
    stock_soup = BeautifulSoup(data, "lxml")
    stock_table = stock_soup.find_all("table")[3]
    stock_fields = stock_table.find_all("tr")[1:3]
    stock_rows = stock_table.find_all("tr")[3:]

    (
        nameColumn,
        classColumn,
        cusipColumn,
        valueColumn,
        shrsColumn,
        multiplier,
    ) = sort_rows(stock_fields[0], stock_fields[1])

    row_stocks = {}
    report_date = filing["report_date"]
    access_number = filing["access_number"]

    for row in stock_rows:
        columns = row.find_all("td")

        if empty:
            yield None

        stock_cusip = columns[cusipColumn].text
        stock_name = columns[nameColumn].text
        stock_value = float(columns[valueColumn].text.replace(",", "")) * multiplier
        stock_shrs_amt = float(columns[shrsColumn].text.replace(",", ""))
        stock_class = columns[classColumn].text

        row_stock = row_stocks.get(stock_cusip)

        if row_stock is None:
            new_stock = {
                "name": stock_name,
                "ticker": "NA",
                "class": stock_class,
                "market_value": stock_value,
                "shares_held": stock_shrs_amt,
                "cusip": stock_cusip,
                "date": report_date,
                "access_number": access_number,
            }
        else:
            new_stock = row_stock
            new_stock["market_value"] = row_stock["market_value"] + stock_value
            new_stock["shares_held"] = row_stock["shares_held"] + stock_shrs_amt

        row_stocks[stock_cusip] = new_stock
        yield new_stock


def scrape_xml(cik, filing, directory, empty=False):
    data = api.sec_directory_search(cik, directory)
    data_str = data.decode(json.detect_encoding(data))
    tree = ElementTree.fromstring(data_str)

    info_table = {}
    namespace = {"ns": "http://www.sec.gov/edgar/document/thirteenf/informationtable"}

    report_date = filing["report_date"]
    access_number = filing["access_number"]

    for info in tree.findall("ns:infoTable", namespace):
        if empty:
            yield None

        stock_cusip = info.find("ns:cusip", namespace).text
        stock_name = info.find("ns:nameOfIssuer", namespace).text
        stock_value = float(info.find("ns:value", namespace).text.replace(",", ""))
        stock_shrs_amt = float(
            info.find("ns:shrsOrPrnAmt", namespace)
            .find("ns:sshPrnamt", namespace)
            .text.replace(",", "")
        )
        stock_class = info.find("ns:titleOfClass", namespace).text

        info_stock = info.get(stock_cusip)

        if info_stock is None:
            new_stock = {
                "name": stock_name,
                "ticker": "NA",
                "class": stock_class,
                "market_value": stock_value,
                "shares_held": stock_shrs_amt,
                "cusip": stock_cusip,
                "date": report_date,
                "access_number": access_number,
            }
        else:
            new_stock = info_stock
            new_stock["market_value"] = info_stock["market_value"] + stock_value
            new_stock["shares_held"] = info_stock["shares_held"] + stock_shrs_amt

        info_table[stock_cusip] = new_stock
        yield new_stock


info_table_key = ["INFORMATION TABLE", "Complete submission text file"]


def scrape_stocks(cik, data, filing, empty=False):
    index_soup = BeautifulSoup(data, "lxml")
    rows = index_soup.find_all("tr")
    directory = {"link": None, "type": None}
    for row in rows:
        items = list(map(lambda b: b.text.strip(), row))
        if any(item in items for item in info_table_key):
            link = row.find("a")
            href = link["href"]

            is_xml = True if href.endswith(".xml") else False
            is_html = True if "xslForm" in href else False
            is_txt = False

            directory_type = directory["type"]
            if is_xml and not is_html:
                directory["type"] = "xml"
                directory["link"] = href
            elif is_xml and is_html and directory_type != "xml":
                directory["type"] = "html"
                directory["link"] = href
            elif is_txt and directory_type != "xml" and directory_type != "html":
                directory["type"] = "txt"
                directory["link"] = href

    link = directory["link"]
    form = directory["type"]
    if not link:
        filing_stocks = {}
        return filing_stocks

    if form == "html":
        scrape_document = scrape_html
    elif form == "xml":
        scrape_document = scrape_xml
    # elif form == "txt":
    #     scrape_document = scrape_txt

    if empty:
        for i, _ in enumerate(scrape_document(cik, filing, link, empty)):
            row_count = i
        return row_count

    update_list = [new_stock for new_stock in scrape_document(cik, filing, link)]
    updated_stocks = process_names(update_list, cik)

    filing_stocks = {}
    for new_stock in update_list:
        stock_cusip = new_stock["cusip"]
        updated_stock = updated_stocks[stock_cusip]

        updated_stock.pop("_id", None)
        new_stock.update(updated_stocks[stock_cusip])

        filing_stocks[stock_cusip] = new_stock

    return filing_stocks

The link may be slightly different then the provided code, as some reliability edits have been made

leftmove added the good first issue Good for newcomers label Apr 3, 2024

leftmove assigned ahiddenproxy May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query tables from plain text files #8

Query tables from plain text files #8

leftmove commented Apr 3, 2024 •

edited

ahiddenproxy commented Apr 20, 2024

leftmove commented May 3, 2024

leftmove commented May 3, 2024

leftmove commented May 3, 2024

Query tables from plain text files #8

Query tables from plain text files #8

Comments

leftmove commented Apr 3, 2024 • edited

ahiddenproxy commented Apr 20, 2024

leftmove commented May 3, 2024

leftmove commented May 3, 2024

leftmove commented May 3, 2024

leftmove commented Apr 3, 2024 •

edited