# Load SEC EDGAR Quarterly (10-Q) XBRL

## Objective

Navigate through the EDGAR XBRL indices and download the 10-Q finantial statements.

## Background
There are multiple ways to reach to the target 10-Q file for the given year/quarter. One way is to use the master indices CSV file for the year/quarter identified with the URL.

```https://www.sec.gov/Archives/edgar/full-index/${YEAR}/${QTR}/xbrl.gz```

Each row in the the master CSV tells where is the 10-Q for a specific company/CIK for the (year, quarter).

| CIK    | Company Name | Form Type          | Date Filed | Filename   |                                           
|--------|--------------|--------------------|------------|------------|
|1047127|AMKOR TECHNOLOGY INC|10-K|2014-02-28|edgar/data/1047127/0001047127-14-000006.txt|


## XBRL file

Use the XBRL file with ```_htm.xml``` suffix instead of the ```.txt``` file. Get the ```_htm.xml``` XBRL file from the EDGAR directory listing in the ```https://www.sec.gov/Archives/edgar/full-index/${YEAR}/${QTR}/${ACCESSION}/index.xml```.


## URL
The URL to 10-Q index.html for the ```(CIK, year, quarter)``` is ```https://sec.gov/Archives/edgar/data/${CIK}/${path}/index.html```

To get ```path```, replace ```-``` (hyphen) and ```.txt``` suffix from the **filename** field value.

# Setup

In [1]:
%%html
<style>
table {float:left}
</style>

In [2]:
!pip install python-Levenshtein 



In [3]:
import logging
import os
import glob
import re
import json
import time
import requests

from bs4 import BeautifulSoup
import pandas as pd
import Levenshtein as levenshtein 

In [4]:
pd.set_option('display.max_colwidth', None)

logging.basicConfig(level=logging.ERROR)
Logger = logging.getLogger(__name__)

---
# EDGAR

* Investopedia - [Where Can I Find a Company's Annual Report and Its SEC Filings?](https://www.investopedia.com/ask/answers/119.asp)

> If you want to dig deeper and go beyond the slick marketing version of the annual report found on corporate websites, you'll have to search through required filings made to the Securities and Exchange Commission. All publicly-traded companies in the U.S. must file regular financial reports with the SEC. These filings include the annual report (known as the 10-K), quarterly report (10-Q), and a myriad of other forms containing all types of financial data.45

# Quarterly filing indices

* [Accessing EDGAR Data](https://www.sec.gov/os/accessing-edgar-data)

> Using the EDGAR index files  
Indexes to all public filings are available from 1994Q3 through the present and located in the following browsable directories:
> * https://www.sec.gov/Archives/edgar/daily-index/ — daily index files through the current year; (**DO NOT forget the trailing slash '/'**)
> * https://www.sec.gov/Archives/edgar/full-index/ — full indexes offer a "bridge" between quarterly and daily indexes, compiling filings from the beginning of the current quarter through the previous business day. At the end of the quarter, the full index is rolled into a static quarterly index.
> 
> Each directory and all child sub directories contain three files to assist in automated crawling of these directories. Note that these are not visible through directory browsing.
> * index.html (the web browser would normally receive these)
> * index.xml (an XML structured version of the same content)
> * index.json (a JSON structured vision of the same content)
> 
> Four types of indexes are available:
> * company — sorted by company name
> * form — sorted by form type
> * master — sorted by CIK number 
> * **XBRL** — list of submissions containing XBRL financial files, sorted by CIK number; these include Voluntary Filer Program submissions
> 
> The EDGAR indexes list the following information for each filing:
> * company name
> * form type
> * central index key (CIK)
> * date filed
> * file name (including folder path)

## Example

Full index files for 2006 QTR 3.
<img src="../image/edgar_full_index_quarter_2006QTR3.png" align="left" width="800"/>

## Constant

In [5]:
EDGAR_BASE_URL = "https://sec.gov/Archives"
EDGAR_HTTP_HEADERS = {"User-Agent": "Company Name myname@company.com"}

DATA_DIR = "../data/listings/XBRL"

# XBRL Listing Handler

In [6]:
def index_xml_url(filename):
    """Generate the EDGAR directory listing index.xml URL.
    https://www.sec.gov/Archives/edgar/data/{CIK}/{ACCESSION}/index.xml"
    """
    url = "/".join([
        EDGAR_BASE_URL, 
        filename.rstrip(".txt").replace('-', ''),
        "index.xml"
    ])
    return url


def edgar_xbrl_listing_file_datafarme(data_dir=DATA_DIR, types=['10-K', '10-Q']):
    """
    Generate a pandas dataframe of each XBRL listintg file sorted by name in alphabitical order.
    The 'Filename' is replaced with the EDGAR directory index.xml listing URL.
    
    Args:
        data_dir: directory where XBRL files are located.
        types: Form types
    Returns:
        pandas df    
    """
    files = sorted(filter( os.path.isfile, glob.glob(data_dir + os.sep + "*") ) )
    for filepath in files:
        filename = os.path.basename(filepath)
        df = pd.read_csv(
            filepath,
            skip_blank_lines=True,
            header=0,         # The 1st data line after omitting skiprows and blank lines.
            sep='|',
            parse_dates=['Date Filed'],
        )
        
        # Select rows for target filing types
        df = df.loc[df['Form Type'].isin(types)] if types else df
        
        # Set the index.xml URL as the filename
        df['Filename'] = df['Filename'].apply(index_xml_url)

        yield (filename, df)

In [7]:
filename, df = next(edgar_xbrl_listing_file_datafarme())
df.head(5)

Unnamed: 0,CIK,Company Name,Form Type,Date Filed,Filename
0,1000697,WATERS CORP /DE/,10-K,2010-02-26,https://sec.gov/Archives/edgar/data/1000697/000095012310017583/index.xml
1,1001039,WALT DISNEY CO/,10-Q,2010-02-09,https://sec.gov/Archives/edgar/data/1001039/000119312510025949/index.xml
3,1001082,DISH Network CORP,10-K,2010-03-01,https://sec.gov/Archives/edgar/data/1001082/000095012310018671/index.xml
4,1001838,SOUTHERN COPPER CORP/,10-K,2010-02-26,https://sec.gov/Archives/edgar/data/1001838/000110465910010334/index.xml
5,1002638,OPEN TEXT CORP,10-Q,2010-02-04,https://sec.gov/Archives/edgar/data/1002638/000119312510021715/index.xml


# Identify the XBRL URLs

In [11]:
def xbrl_url(index_xml_url: str):
    """Generate the URL to the XBML file in the filing directory 
    Args:
        index_xml_url: 
            URL to the EDGAR directory listing index.xml file whose format is e.g.:
            "https://sec.gov/Archives/edgar/data/62996/000095012310013437/index.xml"
    Returns:
        URL to the XBRL file in the filing directory.
    """
    index_xml_url = index_xml_url.strip()
    Logger.debug(f"Identifying XBRL URLs for the listing [%s]" % index_xml_url)

    # --------------------------------------------------------------------------------
    # Filing directory path whose format is "/Archives/edgar/data/{CIK}/{ACCESSION}/".
    # Remove "https://[www.]sec.gov" and "index.xml" from the index_xml_url.
    # --------------------------------------------------------------------------------
    pattern = r"(http|https)://(www\.sec\.gov|sec\.gov)(.*/)index.xml"
    match = re.search(pattern, index_xml_url, re.IGNORECASE)
    assert match and re.match(r"^/Archives/edgar/data/[0-9]*/[0-9]*/", match.group(3)), \
        f"No matching path found by regexp [{pattern}], but got {match}"

    directory = match.group(3)
    Logger.debug("Filing directory path is [%s]" % directory)
    
    # --------------------------------------------------------------------------------
    # GET the index.xml
    # --------------------------------------------------------------------------------
    response = requests.get(index_xml_url, headers=EDGAR_HTTP_HEADERS)
    if response.status_code == 200:
        content = response.content.decode("utf-8") 
    else:
        Logger.error("%s failed with %s" % (index_xml_url, response.status_code))
        assert False, f"{index_xml_url} failed with status {response.status_code}"
    
    # --------------------------------------------------------------------------------
    # Look for the XBRL XML file in the index.xml.
    # 1. _htm.xml file
    # 2. <filename>.xml where "filename" is from <filename>.xsd.
    # 3. <filename>.xml where "filename" is not RNN.xml e.g. R10.xml.
    # --------------------------------------------------------------------------------
    # 1. Look for _htm.xml.
    index = BeautifulSoup(content, 'html.parser')
    # print(index.prettify())
    
    path_to_xbrl = index.find('href', string=re.compile(".*_htm\.xml"))
    if path_to_xbrl:
        url = "https://sec.gov" + path_to_xbrl.string.strip()
        Logger.debug("URL to XBRL is [%s]" % url)
        return url
    else:
        Logger.warning(f"No XBRL with the .*_htm,.xml pattern in the listing {index_xml_url}")

    # 2. Look for XML file for the corresponding XSD.
    path_to_xsd = index.find('href', string=re.compile(re.escape(directory) + ".*\.xsd"))
    if path_to_xsd:
        # Extract filename from "/Archives/edgar/data/{CIK}/{ACCESSION}/<filename>.xsd".        
        pattern = re.escape(directory) + r"(.*)\.xsd"
        path_to_xsd = path_to_xsd.string.strip()
        match = re.search(pattern, path_to_xsd, re.IGNORECASE)
        assert match and match.group(1), f"No filename match for with {patten}"
        
        # Filename of the XSD
        filename = match.group(1)

        # Iterate over all .xml files and find the distance from the XSD filename.
        distance = 999
        candidate = None
        for href in index.find_all('href', string=re.compile(re.escape(directory) + ".*\.xml")):
            pattern = re.escape(directory) + r"(.*)\.xml"
            match = re.search(pattern, href.string.strip(), re.IGNORECASE)
            assert match and match.group(1), f"[{href}] has no .xml with {pattern}"

            potential = match.group(1)
            new_distance = levenshtein.distance(filename, potential)
            if new_distance < distance:
                distance = new_distance
                candidate = potential
                Logger.debug(
                    "Candidate [%s] is picked with the distance from [%s] is [%s]." 
                    % (candidate, filename, distance) 
                )

        if distance < 3:  # Accept within 2-distance away from the XSD filename.
            path_to_xml = directory + candidate + ".xml"
            url = "https://sec.gov" + path_to_xml
            Logger.debug(
                "Selected the candidate [%s] of distance [%s]. \nURL to XBRL is [%s]" 
                % (candidate, distance, url)
            )
            return url
        else:
            Logger.warning(
                "No corresponding XBRL found for the XSD file [%s]." % (filename + ".xsd")
            )
    else:
        Logger.error("No XSD file found in the listing [%s]." % index_xml_url)
    
    # 3. Look for XML with href="/Archives/edgar/data/{CIK}/{ACCESSION}/[^R0-9]*\.xml"
    # Regexp to match the XBRL XML file which is NOT Rnn.xml e.g. R1.xml or R10.xml. 
    # Most likely it has the format if <str>-<YEAR><MONTH><DATE>.xml.
    regexp = re.escape(directory) + r"[^R][a-zA-Z_-]*[0-9][0-9][0-9][0-9][0-9].*\.xml"
    Logger.debug("Look for XBRL XML with the regexp [%s]." % regexp)

    path_to_xbrl = index.find('href', string=re.compile(regexp))
    if path_to_xbrl:
        url = "https://sec.gov" + path_to_xbrl.string.strip()
        Logger.debug("Identified the XBRL URL [%s]." % url)
        return url
    else:
        Logger.warning("No XBRL filename matched with the regexp [%s]." % regexp)

    Logger.error("No XBRL identified in the listing [%s]" % index_xml_url)
    assert False, "No XBRL found. Check [%s] to identify the XBRL." % index_xml_url

    #time.sleep(1)

In [12]:
# Test the sample filling which has irregular XBRL fliename pattern.
xbrl_url("https://sec.gov/Archives/edgar/data/62996/000095012310013437/index.xml")

'https://sec.gov/Archives/edgar/data/62996/000095012310013437/mas-20090930.xml'

## Acquire the XBRL URLs

In [None]:
df['XBRL'] = df['Filename'].apply(xbrl_url)

In [None]:
df.to_csv(f"{DATA_DIR}" + "/xbrl.gz", sep="|", compression="gzip", header=True, index=False)