# Parsing SEC Filing XBRL Document


## Objective

Parse the filing XBRL file to create a DOM like structure that represent the filing data

## References

* [XBRL Specification - Extensible Business Reporting Language (XBRL) 2.1](https://www.xbrl.org/Specification/XBRL-2.1/REC-2003-12-31/XBRL-2.1-REC-2003-12-31+corrected-errata-2013-02-20.html)

* [List of US GAAP Standards](https://xbrlsite.azurewebsites.net/2019/Prototype/references/us-gaap/)
* [XBRL US - List of Elements](https://xbrl.us/data-rule/dqc_0015-le/)

**Element Version**|**Element ID**|**Namespace**|**Element Label**|**Element Name**|**Balance Type**|**Definition**
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
1|1367|us-gaap|Interest Expense|InterestExpense|debit|Amount of the cost of borrowed funds accounted for as interest expense.
2|2692|us-gaap|Cash and Cash Equivalents, at Carrying Value|CashAndCashEquivalentsAtCarryingValue|debit|Amount of currency on hand as well as demand deposits with banks or financial institutions. Includes other kinds of accounts that have the general characteristics of demand deposits. Also includes short-term, highly liquid investments that are both readily convertible to known amounts of cash and so near their maturity that they present insignificant risk of changes in value because of changes in interest rates. Excludes cash and cash equivalents within disposal group and discontinued operation.

## XBRL Element

* [Understanding the Financial Report Logical System](https://www.youtube.com/playlist?list=PLqMZRUzQ64B7EWamzDP-WaYbS_W0RL9nt)
* [XBRL - What is us-gaap:OperatingSegmentsMember element anb where is it defined?](https://money.stackexchange.com/questions/148010/xbrl-what-is-us-gaapoperatingsegmentsmember-element-anb-where-is-it-defined)

### Example
For instance, Qorvo 2020 10K

* [XBRL/rfmd-20210403_htm.xml](https://www.sec.gov/Archives/edgar/data/1604778/000160477821000032/rfmd-20210403_htm.xml)
* [HTML/rfmd-20210403.htm)](https://www.sec.gov/Archives/edgar/data/1604778/000160477821000032/rfmd-20210403.htm):

```
<us-gaap:cashandcashequivalentsatcarryingvalue contextref="*" decimals="-3" id="..." unitref="usd">
  1397880000
</us-gaap:cashandcashequivalentsatcarryingvalue>,
<us-gaap:cashandcashequivalentsatcarryingvalue contextref="***" decimals="-3" id="..." unitref="usd">
  714939000
</us-gaap:cashandcashequivalentsatcarryingvalue>,
<us-gaap:cashandcashequivalentsatcarryingvalue contextref="***" decimals="-3" id="..." unitref="usd">
 711035000
</us-gaap:cashandcashequivalentsatcarryingvalue>
```

Corresponds to the Cash and Cash equivalents in the Cash Flow statement.

<img src="../image/edgar_qorvo_2020_10K_CF.png" align="left" width=800 />

---
# Setup

In [1]:
from typing import (
    List,
    Dict
)
import re
import requests
import unicodedata
import bs4
from bs4 import BeautifulSoup
from IPython.core.display import (
    display, 
    HTML
)

import numpy as np
import pandas as pd
pd.set_option('display.float_format', lambda x: ('%f' % x).rstrip('0').rstrip('.'))
pd.set_option('display.colheader_justify', 'center')

In [2]:
%%html
<style>
table {float:left}
</style>

In [3]:
def restore_windows_1252_characters(restore_string):
    """
        Replace C1 control characters in the Unicode string s by the
        characters at the corresponding code points in Windows-1252,
        where possible.
    """

    def to_windows_1252(match):
        try:
            return bytes([ord(match.group(0))]).decode('windows-1252')
        except UnicodeDecodeError:
            # No character at the corresponding code point: remove it.
            return ''
        
    return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)

---
# Load EDGAR Filing XBRL

Download the ```_htm.xml``` file from EDGAR. SEC now requires user-agent header.

In [4]:
# define the url to specific html_text file
CIK = '1604778'
ACCESSION = '000160477821000032'

FILING_DIR_URL = f"https://www.sec.gov/Archives/edgar/data/{CIK}/{ACCESSION}"
XBRL_NAME = "rfmd-20210403_htm.xml"
XBRL_URL = "/".join([FILING_DIR_URL, XBRL_NAME])

XBRL_URL

'https://www.sec.gov/Archives/edgar/data/1604778/000160477821000032/rfmd-20210403_htm.xml'

In [5]:
headers = {"User-Agent": "Company Name myname@company.com"}
response = requests.get(XBRL_URL, headers=headers)

if response.status_code == 200:
    content = response.content.decode("utf-8") 
else:
    print(f"{XBRL_URL} failed with status {response.status_code}")

In [6]:
soup = BeautifulSoup(content, 'html.parser')

## Repoting period

Each 10-K and 10-Q XBRL has the reporting period for the filing. To exclude the other period, e.g. pervious year or quarter, use the ```context id``` for the reporting period.

For instances:

### QRVO 10-K 2020

```
<context id="ifb6ce67cf6954ebf88471dd82daa9247_D20200329-20210403">
    <entity>
    <identifier scheme="http://www.sec.gov/CIK">0001604778</identifier>
    </entity>
    <period>
        <startDate>2020-03-29</startDate>
        <endDate>2021-04-03</endDate>
    </period>
</context>
```

### AMKR 10-K 2020

```
<context id="i5fac0a392353427b8266f185495754d3_D20200101-20201231">
    <entity>
    <identifier scheme="http://www.sec.gov/CIK">0001047127</identifier>
    </entity>
    <period>
        <startDate>2020-01-01</startDate>
        <endDate>2020-12-31</endDate>
    </period>
</context>
```

### AAPL 10-Q 4th QTR 2020

```
<context id="i6e431846933d461fb8c8c0bdf98c9758_D20200927-20201226">
    <entity>
    <identifier scheme="http://www.sec.gov/CIK">0000320193</identifier>
    </entity>
    <period>
        <startDate>2020-09-27</startDate>
        <endDate>2020-12-26</endDate>
    </period>
</context>
```

In [7]:
CONTEXT = None
for context in soup.find_all('context'):
    if context.find('period') and context.find('period').find('enddate'):
        CONTEXT = context
        break        
        
assert CONTEXT is not None, "Report period not found"

cik_value_in_statement = int(CONTEXT.find('identifier').text.strip())
assert int(CIK) == cik_value_in_statement, \
    "The CIK %s in statement does not match %s" % (cik_value_in_statement, int(CIK))

CONTEXT_ID = CONTEXT['id']
PERIOD = CONTEXT.find('period').find('enddate').text.strip()

print(f"PERIOD is {PERIOD}. CONTEXT_ID is {CONTEXT_ID}")

PERIOD is 2021-04-03. CONTEXT_ID is ifb6ce67cf6954ebf88471dd82daa9247_D20200329-20210403


### Regexp to find all the contexts that match with PERIOD

In [8]:
contexts_regexp = "|".join([
    context['id']
    for context in soup.find_all('context') 
    if context.find('period') and context.find('period').find(['instant'], string=PERIOD)
])
contexts_regexp = rf"{CONTEXT_ID}|" + contexts_regexp
CONTEXT_REGEXP = re.compile(contexts_regexp)

In [9]:
# regexp to extract numeric string
REGEXP_NUMERIC = re.compile(r"\s*[\d.-]*\s*")

re.match(REGEXP_NUMERIC, " -123.12 ")

<re.Match object; span=(0, 9), match=' -123.12 '>

# Constant

In [10]:
# XBRL Namespace
NAMESPACE = "us-gaap"

# XBRL attribute conditions to match when extracting FS elements
ATTRIBUTES = {
    "contextref": CONTEXT_REGEXP,
    "decimals": True, 
    "unitref": True
}

In [11]:
CREDIT_ITEMS = set(map(str.lower, [
    f"{NAMESPACE}:Revenues",
    f"{NAMESPACE}:GrossProfit",
]))

DEBIT_ITEMS = set(map(str.lower, [
    f"{NAMESPACE}:CostOfRevenue",
    f"{NAMESPACE}:CostOfGoods",
    f"{NAMESPACE}:CostOfGoodsAndServicesSold",
    f"{NAMESPACE}:OperatingExpenses",
    f"{NAMESPACE}:ResearchAndDevelopmentExpense",
    f"{NAMESPACE}:SellingGeneralAndAdministrativeExpense",
    f"{NAMESPACE}:OtherCostAndExpenseOperating",
    f"{NAMESPACE}:IncomeTaxExpenseBenefit",
    f"{NAMESPACE}:InterestExpense",
]))

# Utilities

In [12]:
def find_financial_elements(soup, element_names, attributes=ATTRIBUTES):
    """Find the financial statement elements from the XML/HTML source.
    Args:
        soup: BS4 source
        element_names: String or regexp instance to select the financial elements.
        attribute: tag attributes to select the financial elements
    Returns:
        List of BS4 tag objects that matched the element_names and attributes.
    """
    assert isinstance(soup, BeautifulSoup)
    assert isinstance(element_names, re.Pattern) or isinstance(element_names, str)

    names = element_names.lower() if isinstance(element_names, str) else element_names
    """
    return soup.findAll(
        name=names,
        string=REGEXP_NUMERIC,
        attrs=attributes
    """
    return [
        soup.find(
            name=names,
            string=REGEXP_NUMERIC,
            attrs=attributes
        )
    ]

In [13]:
def get_financial_element_numeric_values(elements):
    assert isinstance(elements, bs4.element.ResultSet) or isinstance(elements[0], bs4.element.Tag)
    
    values = []
    for element in elements:
        assert re.match(REGEXP_NUMERIC, element.text.strip()), f"Element must be numeric but {element.text}"
        values.append(float(element.text))
        
    return values

In [14]:
def display_items(elements):
    assert isinstance(elements, bs4.element.ResultSet) or isinstance(elements[0], bs4.element.Tag)
    for element in elements: # decimals="-3" means the displayed value is divied by 1000.
        print(f"{element.name} {element['unitref']:5} {element['decimals']:5} {element.text:15} ")

In [15]:
def get_financial_element_record(elements):
    assert isinstance(elements, bs4.element.ResultSet) or isinstance(elements[0], bs4.element.Tag)
    return [
        [
            'debit' if element.name in DEBIT_ITEMS else 'credit',
            element.name, 
            element['unitref'], 
            int(element['decimals']), 
            float(element.text), 
            element['contextref']
        ]
        for element in elements
    ]

In [16]:
def get_financial_element_columns():
    return ["type", "name", "unit", "decimals", "value", "context"]

---
# Shares Outstanding

In [17]:
names = re.compile("|".join([
    rf"{NAMESPACE}:SharesOutstanding",
    rf"{NAMESPACE}:CommonStockSharesOutstanding",
    rf"{NAMESPACE}:CommonStockOtherSharesOutstanding",
]).lower())

shares_outstanding = find_financial_elements(soup=soup, element_names=names)
display_items(shares_outstanding)

us-gaap:commonstocksharesoutstanding shares -3    112557000       


In [18]:
df_ShareOutstanding = pd.DataFrame(
    [
        [
            element.name, 
            element['unitref'], 
            int(element['decimals']), 
            float(element.text), 
            element['contextref']
        ]
        for element in shares_outstanding
    ],
    columns=["name", "unit", "decimals", "value", "context"]
)

SHARES_OUTSTANDING = df_ShareOutstanding['value'].sum()
print(SHARES_OUTSTANDING)

df_ShareOutstanding

112557000.0


Unnamed: 0,name,unit,decimals,value,context
0,us-gaap:commonstocksharesoutstanding,shares,-3,112557000,i531402faf1d04969ac2b2ba0e1680766_I20210403


---
# Statements of Income (P/L)

In [19]:
PL = []

## Revenues

In [20]:
revenues = find_financial_elements(soup=soup, element_names=re.compile(rf"{NAMESPACE}:Revenues".lower()))

PL += get_financial_element_record(revenues)

display_items(revenues)

us-gaap:revenues usd   -3    4015307000      


## Cost of Revenues

In [21]:
names = re.compile("|".join([
    rf"^{NAMESPACE}:CostOfRevenue$",
    rf"^{NAMESPACE}:CostOfGoods$",
    rf"^{NAMESPACE}:CostOfGoodsAndServicesSold$",
]).lower())

costs_of_revenues = find_financial_elements(soup=soup, element_names=names)
PL += get_financial_element_record(costs_of_revenues)

display_items(costs_of_revenues)

us-gaap:costofgoodsandservicessold usd   -3    2131741000      


## ***___# Gross Profit___***

In [22]:
names = f"{NAMESPACE}:GrossProfit"
gross_profit = find_financial_elements(soup=soup, element_names=names)

display_items(gross_profit) 

us-gaap:grossprofit usd   -3    1883566000      


## Operating Expenses

### Research and Development

In [23]:
names = f"{NAMESPACE}:ResearchAndDevelopmentExpense".lower()
operating_expense_r_and_d = find_financial_elements(soup=soup, element_names=names)
PL += get_financial_element_record(operating_expense_r_and_d)

display_items(operating_expense_r_and_d) 

us-gaap:researchanddevelopmentexpense usd   -3    570395000       


### Administrative Expense

In [24]:
names = f"{NAMESPACE}:SellingGeneralAndAdministrativeExpense"
operating_expense_administrative = find_financial_elements(soup=soup, element_names=names)
PL += get_financial_element_record(operating_expense_administrative)

display_items(operating_expense_administrative) 

us-gaap:sellinggeneralandadministrativeexpense usd   -3    367238000       


### Other operating expenses

In [25]:
names = f"{NAMESPACE}:OtherCostAndExpenseOperating"
operating_expense_other = find_financial_elements(soup=soup, element_names=names)
PL += get_financial_element_record(operating_expense_other)

display_items(operating_expense_other) 

us-gaap:othercostandexpenseoperating usd   -3    39306000        


## ***___# Total Operating Expenses___***

In [26]:
names = f"{NAMESPACE}:OperatingExpenses".lower()
operating_expense_total = find_financial_elements(soup=soup, element_names=names)

display_items(operating_expense_total) 

us-gaap:operatingexpenses usd   -3    976939000       


## ***___# Operating Income___***

$GrossProfit - Total Operating Expenses$

In [27]:
names = f"{NAMESPACE}:OperatingIncomeLoss"
operating_income_or_loss = find_financial_elements(soup=soup, element_names=names)

display_items(operating_income_or_loss) 

us-gaap:operatingincomeloss usd   -3    906627000       


## Other Expenses

### Interest Expense

* [Investopedia - What Is an Interest Expense?](https://www.investopedia.com/terms/i/interestexpense.asp)

> An interest expense is the cost incurred by an entity for borrowed funds. Interest expense is a non-operating expense shown on the income statement. It represents interest payable on any borrowings – bonds, loans, convertible debt or lines of credit. It is essentially calculated as the interest rate times the outstanding principal amount of the debt. Interest expense on the income statement represents ***interest accrued during the period*** covered by the financial statements, and **NOT the amount of interest paid over that period**. While interest expense is tax-deductible for companies, in an individual's case, it depends on his or her jurisdiction and also on the loan's purpose.  
>
> For most people, mortgage interest is the single-biggest category of interest expense over their lifetimes as interest can total tens of thousands of dollars over the life of a mortgage as illustrated by online calculators.

In [28]:
names = f"{NAMESPACE}:InterestExpense"
interest_expense = find_financial_elements(soup=soup, element_names=names)
PL += get_financial_element_record(interest_expense)

display_items(interest_expense) 

us-gaap:interestexpense usd   -3    75198000        


### Other Non-operating Expenses

In [29]:
names = f"{NAMESPACE}:OtherNonoperatingIncomeExpense"
non_operating_expense = find_financial_elements(soup=soup, element_names=names)
PL += get_financial_element_record(non_operating_expense)

display_items(non_operating_expense) 

us-gaap:othernonoperatingincomeexpense usd   -3    -24049000       


## Income Tax

In [30]:
names = f"{NAMESPACE}:IncomeTaxExpenseBenefit"
income_tax_or_benefit = find_financial_elements(soup=soup, element_names=names)
PL += get_financial_element_record(income_tax_or_benefit)

display_items(income_tax_or_benefit) 

us-gaap:incometaxexpensebenefit usd   -3    73769000        


## ***___# Net Income___***

$GrossProfit - (Operating Expenses + NonOperating Expense) - Tax$

In [31]:
names = f"{NAMESPACE}:NetIncomeLoss"
net_income_or_loss = find_financial_elements(soup=soup, element_names=names)

display_items(net_income_or_loss) 

us-gaap:netincomeloss usd   -3    733611000       


## ***___# Net Income Per Share___***

* [US GAAP - Is Net Income Per Share the same with EPS?](https://money.stackexchange.com/questions/148015/us-gaap-is-net-income-per-share-the-same-with-eps)

In [32]:
names = f"{NAMESPACE}:EarningsPerShareBasic"
income_tax_or_benefit = find_financial_elements(soup=soup, element_names=names)

display_items(income_tax_or_benefit) 

us-gaap:earningspersharebasic usdPerShare 2     6.43            


In [33]:
NET_INCOME = sum(get_financial_element_numeric_values(net_income_or_loss))
NET_INCOME_PER_SHARE = NET_INCOME / SHARES_OUTSTANDING

NET_INCOME_PER_SHARE

6.517684373250886

## P/L

Is ```us-gaap:othernonoperatingincomeexpense``` credit or debit? As the value is **negative** and so is in the Income Statement, is shoudl be credit -> To be confirmed. 

In [34]:
df_PL = pd.DataFrame(PL, columns=get_financial_element_columns())
df_PL

Unnamed: 0,type,name,unit,decimals,value,context
0,credit,us-gaap:revenues,usd,-3,4015307000,ifb6ce67cf6954ebf88471dd82daa9247_D20200329-20...
1,debit,us-gaap:costofgoodsandservicessold,usd,-3,2131741000,ifb6ce67cf6954ebf88471dd82daa9247_D20200329-20...
2,debit,us-gaap:researchanddevelopmentexpense,usd,-3,570395000,ifb6ce67cf6954ebf88471dd82daa9247_D20200329-20...
3,debit,us-gaap:sellinggeneralandadministrativeexpense,usd,-3,367238000,ifb6ce67cf6954ebf88471dd82daa9247_D20200329-20...
4,debit,us-gaap:othercostandexpenseoperating,usd,-3,39306000,ifb6ce67cf6954ebf88471dd82daa9247_D20200329-20...
5,debit,us-gaap:interestexpense,usd,-3,75198000,ifb6ce67cf6954ebf88471dd82daa9247_D20200329-20...
6,credit,us-gaap:othernonoperatingincomeexpense,usd,-3,-24049000,ifb6ce67cf6954ebf88471dd82daa9247_D20200329-20...
7,debit,us-gaap:incometaxexpensebenefit,usd,-3,73769000,ifb6ce67cf6954ebf88471dd82daa9247_D20200329-20...


In [35]:
credits = df_PL[df_PL['type'] == 'credit']['value'].sum()
credits

3991258000.0

In [36]:
debits = df_PL[df_PL['type'] == 'debit']['value'].sum()
debits

3257647000.0

In [39]:
credits - debits  # Equal to the Net Income

733611000.0

---
# Balance Sheet (B/S)

## Cash & Cash Equivalents

Look for the cash and cash equivalents for the reporting perid in the Balance Sheet and Cash Flow statements of the  10-K.

In [38]:
names = f"{NAMESPACE}:CashAndCashEquivalentsAtCarryingValue"

cash_equivalents = find_financial_elements(soup=soup, element_names=names)
display_items(cash_equivalents) 

us-gaap:cashandcashequivalentsatcarryingvalue usd   -3    1397880000      


---

---