<a href="https://colab.research.google.com/github/kondurupriyanka/AI_Financial_Assistant_BCG/blob/main/extracting_financial_10_k_reports_via_sec_edgar_db.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="background-color: #ffd5cd; text-align: center"> Extracting Financial 10-K Statements from SEC'S EDGAR Database</h1>

In this notebook, I aim to extract financial 10-K statements from the U.S. *Securities and Exchange Commision (SEC's)*, EDGAR database. In order to give an example, I would be extracting financial 10-K statements of 5 companies that belong to different sectors. These 5 stocks will constitute my diverse portfolio. This notebook is a work-in-progress. I would be leveraging the textual information from financial 10-K statements to perform NLP Analysis on them. This notebook covers extracting financial 10-K statements of companies and their preprocessing steps. The notebook is organized as follows:

1. SEC EDGAR database overview.
2. Fetching financial 10-K reports via SEC API
3. Download 10-K statements
4. Get documents.
5. Get document types.
6. Preprocessing 10-K documents
    * Parsing via BeautifulSoup
    * Lemmatization
    * Stop-words removal
7. Future work
8. References

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


<h1 style="background-color: #ffd5cd; text-align: center"> Fetching Financial 10-K Reports via SEC API</h1>

When companies file their 10-K reports to the SEC, it is gathered in the EDGAR database and is publicly available for investors to download or search for company-wise filing reports, we need to submit an HTTPS request to the following REST Url:

<p style="text-align:center; color:blue;">https://www.sec.gov/cgi-bin/browse-edgar</p>

To specify  the details of the report in which we are specifically interested in, we need to pass the following query parameters. To specify the details of the report in which we are specifically interested in, we need to pass the following query parameters:

1. *CIK number (CIK)*: a unique numerical identifier assigned by the EDGAR system.
2. *Report type (type)*:  type of financial report that we wish to query. Example 10-K, 10-Q, 14-K.
3. *Prior-to date (dateb)*: EDGAR accepts a prior-to date that identifies the latest date in which we are interested.   
4. *The number of reports (count)*: this quantity describes the number of filings up to the prior-to date.
5. *Ownership (owner)*: The SEC requires filings from individuals who own significant amounts of the company’s stock. Setting the owner parameter to exclude, EDGAR won’t provide reports related to its director or officer ownership [1].

As an example, to download Nike’s annual report before 2020, where Nike’s CIK number is 0000320187, and 10-K denotes the type of annual reports, we would form the EDGAR Url as follows :

<p style="text-align:center; color:blue"> https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000320187&type=10-K&dateb=20200101&count=60&owner=exclude</p>




In [None]:
cik_lookup = {
    'AMZN': '0001018724',
    'JNJ': '0000200406',
    'MCD': '0000063908',
    'PEP': '0000077476',
    'WMT': '0000104169'}

In [None]:
!pip install ratelimit

Collecting ratelimit
  Downloading ratelimit-2.2.1.tar.gz (5.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ratelimit
  Building wheel for ratelimit (setup.py) ... [?25l[?25hdone
  Created wheel for ratelimit: filename=ratelimit-2.2.1-py3-none-any.whl size=5895 sha256=80881b47966742f680a4f48ce74b817e09241b47138695ecc7ef90721d65c6af
  Stored in directory: /root/.cache/pip/wheels/27/5f/ba/e972a56dcbf5de9f2b7d2b2a710113970bd173c4dcd3d2c902
Successfully built ratelimit
Installing collected packages: ratelimit
Successfully installed ratelimit-2.2.1


In [None]:
from ratelimit import limits, sleep_and_retry
import requests

class SecAPI(object):
    SEC_CALL_LIMIT = {'calls': 10, 'seconds': 1}

    @staticmethod
    @sleep_and_retry
    @limits(calls=SEC_CALL_LIMIT['calls'] / 2, period=SEC_CALL_LIMIT['seconds'])
    def _call_sec(url):
        return requests.get(url)

    def get(self, url):
        return self._call_sec(url).text

def print_ten_k_data(ten_k_data, fields, field_length_limit=50):
    indentation = '  '

    print('[')
    for ten_k in ten_k_data:
        print_statement = '{}{{'.format(indentation)
        for field in fields:
            value = str(ten_k[field])

            if isinstance(value, str):
                value_str = '\'{}\''.format(value.replace('\n', '\\n'))
            else:
                value_str = str(value)

            if len(value_str) > field_length_limit:
                value_str = value_str[:field_length_limit] + '...'

            print_statement += '\n{}{}: {}'.format(indentation * 2, field, value_str)

        print_statement += '},'
        print(print_statement)
    print(']')


In [None]:
from bs4 import BeautifulSoup

sec_api = SecAPI()

def get_sec_data(cik, doc_type, start=0, count=60):
    newest_pricing_date = pd.to_datetime('2018-08-01')
    rss_url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany' \
        '&CIK={}&type={}&start={}&count={}&owner=exclude&output=atom' \
        .format(cik, doc_type, start, count)
    sec_data = sec_api.get(rss_url)
    return sec_data

Hitting the EDGAR REST Url that we formed above would redirect us to a web page that contains tabular data related to the company’s type of filing document, document description, filing date, and file number. Figure 1 shows the EDGAR’s search result dashboard after hitting the REST Url formed above.

![image.png](attachment:image.png)<br>
<p style="text-align:center;"><b>Figure 1: EDGAR search results for Nike’s 10-K documents prior-to 2020-01-01.</b>
</p>


After fetching the web page as a response, we can perform web scraping with Python by leveraging the *BeautifulSoup* library and access the links that would help us download the 10-K filing reports. These document links will help us download the pure HTML version of the desired 10-K document, which we store in a dictionary against the corresponding stock’s CIK number.

In [None]:
def get_sec_data(cik, doc_type, start=0, count=60):
    newest_pricing_data = pd.to_datetime('2020-01-01')
    rss_url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany' \
        '&CIK={}&type={}&start={}&count={}&owner=exclude&output=atom' \
        .format(cik, doc_type, start, count)
    sec_data = sec_api.get(rss_url)
    feed = BeautifulSoup(sec_data.encode('ascii'), 'xml').feed
    entries = [
        (
            entry.content.find('filing-href').getText(),
            entry.content.find('filing-type').getText(),
            entry.content.find('filing-date').getText())
        for entry in feed.find_all('entry', recursive=False)
        if pd.to_datetime(entry.content.find('filing-date').getText()) <= newest_pricing_data]

    return entries

In [None]:
import pandas as pd
from bs4 import BeautifulSoup

# Assuming sec_api and SecAPI are defined and working correctly

def get_sec_data(cik, doc_type, start=0, count=60):
    newest_pricing_data = pd.to_datetime('2020-01-01')
    rss_url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany' \
        '&CIK={}&type={}&start={}&count={}&owner=exclude&output=atom' \
        .format(cik, doc_type, start, count)

    # Assuming sec_api.get() returns a dictionary and the HTML content
    # is stored under the key 'content'
    sec_data = sec_api.get(rss_url)

    # Extract the content from dictionary and use it as input for BeautifulSoup
    sec_data_content = sec_data.get('content', '') # Get content, default to '' if not found

    # Then pass the content string to BeautifulSoup
    feed = BeautifulSoup(sec_data_content, 'xml').feed

    entries = [
        (
            entry.content.find('filing-href').getText(),
            entry.content.find('filing-type').getText(),
            entry.content.find('filing-date').getText())
        for entry in feed.find_all('entry', recursive=False)
        if pd.to_datetime(entry.content.find('filing-date').getText()) <= newest_pricing_data]

    return entries

Let's pull the list using the `get_sec_data` function, then display some of the results. For displaying some of the data, we'll use Amazon as an example.

In [None]:
def get_sec_data(cik, doc_type, start=0, count=60):
    newest_pricing_data = pd.to_datetime('2020-01-01')
    rss_url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany' \
        '&CIK={}&type={}&start={}&count={}&owner=exclude&output=atom' \
        .format(cik, doc_type, start, count)
    sec_data = sec_api.get(rss_url)
    # The issue is likely caused by forcing the encoding to 'ascii'.
    # BeautifulSoup can usually auto-detect the encoding,
    # or you can try 'utf-8' which is more versatile.
    feed = BeautifulSoup(sec_data, 'xml').feed
    entries = [
        (
            entry.content.find('filing-href').getText(),
            entry.content.find('filing-type').getText(),
            entry.content.find('filing-date').getText())
        for entry in feed.find_all('entry', recursive=False)
        if pd.to_datetime(entry.content.find('filing-date').getText()) <= newest_pricing_data]

    return entries

In [None]:
import pandas as pd
from bs4 import BeautifulSoup

def get_sec_data(cik, doc_type, start=0, count=60):
    newest_pricing_data = pd.to_datetime('2020-01-01')
    rss_url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany' \
        '&CIK={}&type={}&start={}&count={}&owner=exclude&output=atom' \
        .format(cik, doc_type, start, count)

    sec_data = sec_api.get(rss_url)

    # Assuming sec_api.get() returns a string, use it directly as input for BeautifulSoup
    feed = BeautifulSoup(sec_data, 'xml').feed

    entries = [
        (
            entry.content.find('filing-href').getText(),
            entry.content.find('filing-type').getText(),
            entry.content.find('filing-date').getText())
        for entry in feed.find_all('entry', recursive=False)
        if pd.to_datetime(entry.content.find('filing-date').getText()) <= newest_pricing_data]

    return entries

In [None]:
def get_sec_data(cik, doc_type, start=0, count=10):
    try:
        # Fetch SEC filings
        response = requests.get(f"https://data.sec.gov/{cik}/{doc_type}")
        response.raise_for_status()
        feed = BeautifulSoup(response.content, 'xml').feed

        if feed is None:
            print(f"No data found for CIK: {cik}")
            return []

        return [
            (
                entry.content.find('filing-type').getText(),
                entry.content.find('filing-date').getText()
            )
            for entry in feed.find_all('entry', recursive=False)
        ]
    except Exception as e:
        print(f"Error fetching SEC data for CIK {cik}: {e}")
        return []


In [None]:
import pprint

example_ticker = 'AMZN'
sec_data = {}

for ticker, cik in cik_lookup.items():
    sec_data[ticker] = get_sec_data(cik, '10-K')

pprint.pprint(sec_data[example_ticker][:5])

Error fetching SEC data for CIK 0001018724: 403 Client Error: Forbidden for url: https://data.sec.gov/0001018724/10-K
Error fetching SEC data for CIK 0000200406: 403 Client Error: Forbidden for url: https://data.sec.gov/0000200406/10-K
Error fetching SEC data for CIK 0000063908: 403 Client Error: Forbidden for url: https://data.sec.gov/0000063908/10-K
Error fetching SEC data for CIK 0000077476: 403 Client Error: Forbidden for url: https://data.sec.gov/0000077476/10-K
Error fetching SEC data for CIK 0000104169: 403 Client Error: Forbidden for url: https://data.sec.gov/0000104169/10-K
[]


In [None]:
from tqdm import tqdm

raw_fillings_by_ticker = {}

for ticker, data in sec_data.items():
    raw_fillings_by_ticker[ticker] = {}
    for index_url, file_type, file_date in tqdm(data, desc=f'Downloading {ticker} Filings', unit='filing'):
        if file_type == '10-K':
            file_url = index_url.replace('-index.htm', '.txt').replace('.txtl', '.txt')

            try:
                response = sec_api.get(file_url)
                raw_fillings_by_ticker[ticker][file_date] = response
            except Exception as e:
                print(f"Error fetching {file_url} for {ticker}: {e}")

# Example document output
if raw_fillings_by_ticker[example_ticker]:
    print('Example Document:\n\n{}...'.format(
        next(iter(raw_fillings_by_ticker[example_ticker].values()))[:1000]
    ))
else:
    print(f"No data available for {example_ticker}")


Downloading AMZN Filings: 0filing [00:00, ?filing/s]
Downloading JNJ Filings: 0filing [00:00, ?filing/s]
Downloading MCD Filings: 0filing [00:00, ?filing/s]
Downloading PEP Filings: 0filing [00:00, ?filing/s]
Downloading WMT Filings: 0filing [00:00, ?filing/s]

No data available for AMZN





<h1 style="background-color: #ffd5cd; text-align: center"> Download 10-K Statements</h1>

<h1 style="background-color: #ffd5cd; text-align: center"> Get Documents</h1>

With theses fillings downloaded, we want to break them into their associated documents. These documents are sectioned off in the fillings with the tags `<DOCUMENT>` for the start of each document and `</DOCUMENT>` for the end of each document. There's no overlap with these documents, so each `</DOCUMENT>` tag should come after the `<DOCUMENT>` with no `<DOCUMENT>` tag in between.

In [None]:
import re


def get_documents(text):
    """
    Extract the documents from the text

    Parameters
    ----------
    text : str
        The text with the document strings inside

    Returns
    -------
    extracted_docs : list of str
        The document strings found in `text`
    """

    final_docs = []
    start_regex = re.compile(r'<DOCUMENT>')
    end_regex = re.compile(r'</DOCUMENT>')

    start_idx = [x.end() for x in re.finditer(start_regex, text)]
    end_idx = [x.start() for x in re.finditer(end_regex, text)]

    for start_i, end_i in zip(start_idx, end_idx):
        final_docs.append(text[start_i:end_i])


    return final_docs

In [None]:
filling_documents_by_ticker = {}

for ticker, raw_fillings in raw_fillings_by_ticker.items():
    filling_documents_by_ticker[ticker] = {}
    for file_date, filling in tqdm(raw_fillings.items(), desc='Getting Documents from {} Fillings'.format(ticker), unit='filling'):
        filling_documents_by_ticker[ticker][file_date] = get_documents(filling)


print('\n\n'.join([
    'Document {} Filed on {}:\n{}...'.format(doc_i, file_date, doc[:200])
    for file_date, docs in filling_documents_by_ticker[example_ticker].items()
    for doc_i, doc in enumerate(docs)][:3]))

Getting Documents from AMZN Fillings: 0filling [00:00, ?filling/s]
Getting Documents from JNJ Fillings: 0filling [00:00, ?filling/s]
Getting Documents from MCD Fillings: 0filling [00:00, ?filling/s]
Getting Documents from PEP Fillings: 0filling [00:00, ?filling/s]
Getting Documents from WMT Fillings: 0filling [00:00, ?filling/s]







<h1 style="background-color: #ffd5cd; text-align: center">Get Document Types</h1>

Now that we have all the documents, we want to find the 10-k form in this 10-k filing. The `get_document_type` function returns the type of document given. The document type is located on a line with the `<TYPE>` tag. For example, a form of type "TEST" would have the line `<TYPE>TEST`.

In [None]:
def get_document_type(doc):
    """
    Return the document type lowercased

    Parameters
    ----------
    doc : str
        The document string

    Returns
    -------
    doc_type : str
        The document type lowercased
    """

    # Regex explaination : Here I am tryng to do a positive lookbehind
    # (?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt.
    # More reference : https://www.regular-expressions.info/lookaround.html

    type_regex = re.compile(r'(?<=<TYPE>)\w+[^\n]+') # gives out \w
    type_idx = re.search(type_regex, doc).group(0).lower()
    return type_idx

In [None]:
ten_ks_by_ticker = {}

for ticker, filling_documents in filling_documents_by_ticker.items():
    ten_ks_by_ticker[ticker] = []
    for file_date, documents in filling_documents.items():
        for document in documents:
            if get_document_type(document) == '10-k':
                ten_ks_by_ticker[ticker].append({
                    'cik': cik_lookup[ticker],
                    'file': document,
                    'file_date': file_date})

print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['cik', 'file', 'file_date'])

[
]


<h1 style="background-color: #ffd5cd; text-align: center">Preprocessing 10-K documents</h1>

<h2 style="background-color: #ffd5cd; text-align: center">Parsing via BeautifulSoup</h2>

In [None]:
def remove_html_tags(text):
    text = BeautifulSoup(text, 'html.parser').get_text()

    return text


def clean_text(text):
    text = text.lower()
    text = remove_html_tags(text)

    return text

In [None]:
for ticker, ten_ks in ten_ks_by_ticker.items():
    for ten_k in tqdm(ten_ks, desc='Cleaning {} 10-Ks'.format(ticker), unit='10-K'):
        ten_k['file_clean'] = clean_text(ten_k['file'])


print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['file_clean'])

Cleaning AMZN 10-Ks: 010-K [00:00, ?10-K/s]
Cleaning JNJ 10-Ks: 010-K [00:00, ?10-K/s]
Cleaning MCD 10-Ks: 010-K [00:00, ?10-K/s]
Cleaning PEP 10-Ks: 010-K [00:00, ?10-K/s]
Cleaning WMT 10-Ks: 010-K [00:00, ?10-K/s]

[
]





<h2 style="background-color: #ffd5cd; text-align: center">Lemmatization</h2>

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


def lemmatize_words(words):
    """
    Lemmatize words

    Parameters
    ----------
    words : list of str
        List of words

    Returns
    -------
    lemmatized_words : list of str
        List of lemmatized words
    """

    wnl = WordNetLemmatizer()
    lemmatized_words = [wnl.lemmatize(word, 'v') for word in words]

    return lemmatized_words

In [None]:
word_pattern = re.compile('\w+')

for ticker, ten_ks in ten_ks_by_ticker.items():
    for ten_k in tqdm(ten_ks, desc='Lemmatize {} 10-Ks'.format(ticker), unit='10-K'):
        ten_k['file_lemma'] = lemmatize_words(word_pattern.findall(ten_k['file_clean']))

Lemmatize AMZN 10-Ks: 010-K [00:00, ?10-K/s]
Lemmatize JNJ 10-Ks: 010-K [00:00, ?10-K/s]
Lemmatize MCD 10-Ks: 010-K [00:00, ?10-K/s]
Lemmatize PEP 10-Ks: 010-K [00:00, ?10-K/s]
Lemmatize WMT 10-Ks: 010-K [00:00, ?10-K/s]


In [None]:
print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['file_lemma'])

[
]


<h2 style="background-color: #ffd5cd; text-align: center">Stop-words Removal</h2>

In [None]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
pip install nltk




In [None]:
nltk.download('stopwords', download_dir='/content/nltk_data')  # Specify a local directory


[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm

# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Define lemmatization function
def lemmatize_words(words):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in words]

# Load English stopwords and lemmatize them
lemma_english_stopwords = lemmatize_words(stopwords.words('english'))

# Example input: Replace with your actual 10-K data structure
ten_ks_by_ticker = {
    "AMZN": [
        {"file_lemma": ["Amazon", "is", "a", "leading", "ecommerce", "platform"]},
        {"file_lemma": ["The", "company", "focuses", "on", "customer", "satisfaction"]}
    ]
}

# Remove stopwords from each 10-K's lemmatized words
for ticker, ten_ks in ten_ks_by_ticker.items():
    for ten_k in tqdm(ten_ks, desc=f'Remove Stop Words for {ticker} 10-Ks', unit='10-K'):
        ten_k['file_lemma'] = [word for word in ten_k['file_lemma'] if word not in lemma_english_stopwords]

# Print processed data
example_ticker = "AMZN"
print(ten_ks_by_ticker[example_ticker])


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
Remove Stop Words for AMZN 10-Ks: 100%|██████████| 2/2 [00:00<00:00, 884.7810-K/s]

[{'file_lemma': ['Amazon', 'leading', 'ecommerce', 'platform']}, {'file_lemma': ['The', 'company', 'focuses', 'customer', 'satisfaction']}]





In [None]:
from nltk.corpus import stopwords


lemma_english_stopwords = lemmatize_words(stopwords.words('english'))

for ticker, ten_ks in ten_ks_by_ticker.items():
    for ten_k in tqdm(ten_ks, desc='Remove Stop Words for {} 10-Ks'.format(ticker), unit='10-K'):
        ten_k['file_lemma'] = [word for word in ten_k['file_lemma'] if word not in lemma_english_stopwords]


print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['file_lemma'])

Remove Stop Words for AMZN 10-Ks: 100%|██████████| 2/2 [00:00<00:00, 2644.5810-K/s]

[
  {
    file_lemma: '['Amazon', 'leading', 'ecommerce', 'platform']'},
  {
    file_lemma: '['The', 'company', 'focuses', 'customer', 'satisf...},
]





<h1 style="background-color: #ffd5cd; text-align: center">References</h1>

1. [GitHub - AI for Trading](https://github.com/purvasingh96/AI-for-Trading/tree/master/Term%202/Projects/Project%20-%205%20-%20NLP%20on%20Financial%20Statements)
2. [SEC website](https://www.sec.gov/edgar/searchedgar/companysearch.html)
3. [Udacity's Nanodegree materials](https://www.udacity.com/course/ai-for-trading--nd880)