### SEC Scraper Workbook

### 1. Statement

This workbook is an illustration on extracting, formulating, and exporting companies' data from SEC Filing website. The following codes are only for internal use of ABB's consulting practicum project and research purpose. For more details of the data source, please refer to https://www.sec.gov/edgar/searchedgar/companysearch.

The SEC Scraping tool requires three modules, `cikscraping.py` that processes the company tickers and returns corresponding CIK keys, `webscraping.py` that returns the 10-K report json file links of those companies, and `docscraping.py` that parses the context of the 10-K reports and returns the three variables.

Becasue of the inconsistent writing style of different 10-K reports, we use different techniques to optimize our code and minimize the uncertainties when extracting the text. For more details of the text preprocessing, please refer to `DocScraper_test1.ipynb` to see the outputs of the program.

### 2. Code

In [15]:
import pandas as pd

#### 2.1 *cikscraping.py*

##### 2.1.1 *CIKScraper(ticker, headers)*

SEC Filing implements a unique code called CIK key to identify companies, and it is necessary for our web scraping tool. Therefore, cikscraping.py creates a `CIKScraper()` object such that given the company ticker, it would automatically returns the cik key of that company.

In [1]:
from cikscraping import CIKScraper
headers = {"User-Agent": "bxie43@wisc.edu"}
company = CIKScraper("ADM", headers)

##### 2.2.2 *.parsint_tickers()*

The `parsing_tickers()` method returns a ticker table by fetching the corresponding CIK value from https://www.sec.gov/files/company_tickers.json given the company's ticker. For example, in the example, by entering "ADM" ticker as an argument, the method would show that `7084` is the CIK key of Archer Daniels Midland.

In [3]:
company.parsing_tickers()

Unnamed: 0,cik_str,ticker,title
317,7084,ADM,Archer-Daniels-Midland Co


#### 2.2 *webscraping.py*

After tetting the cik value, we can now fetch the urls of the 10-K documents of the companies on list. For details of how to fetch the urls, please refer to `SSRN-id3230156.pdf` under the same file location.

##### 2.2.1 *SECScraper(cik, year, FILE, headers)*

We create a `SECScraper()` object to store the urls of the documents. By specifiying the company's CIK from `CIKScraper()`, the year, and the type of the document, for example, "10-K" or "10-Q", the object would add the url of that document as an attribute.

In [8]:
from webscraping import SECScraper
year = '2023'
FILE = '10-K' # Specify that we want the 10-K reports.
scraper = SECScraper(company.cik, year, FILE, headers)

##### 2.2.2 *scrape_sec_data()*

The method `scrape_sec_data(self)` visists a json file published by Edgar that includes master indices of companies' documents.

In [9]:
url = scraper.scrape_sec_data()[0]
url

'https://www.sec.gov/Archives/edgar/data/7084/0000007084-23-000010.txt'

#### 2.3 *docscraping.py*

After obtaining the url of the document, e.g. 10-K report in year 2023, we are able to parse the text and extract three variables: `regulatory_compliance`, `product portfolio`, and `analysis of acquisitions`. The parsing process would be implemented within `docscraping.py`

##### 2.3.1 *DocScraper(url, FILE, headers, api_key, year, ticker)*

The `DocScraper()` object includes seven arguments to parse corresponding document. The `url`, `FILE`, `headers`, `year`, and `ticker` have been obtained from the previous steps. The new argument `api_key` is the key to the SEC API https://sec-api.io/. This API allows users to automatically split the document into different item sections. This is significantly helpful because the variables from 10-K reports varies with the companies, making it difficult for manual selection.

In [12]:
from docscraping import DocScraper
api_key = '4443770e404975687133744ecfc296d86e74498fc8e5536678eae20c35423505'
parser = DocScraper(url, FILE, headers, api_key, year, 'ADM')

##### 2.3.2 *regulatory_compliance(item1a, ticker)*

The `regulatory_compliance()` function parses the `item1a` section of 10-K reports and returns a dictionary containing companies' information about their risks, then selects the keys contain 'regulatory' or 'legal' that indicates such risk is related to regulations. There are mainly three types of patterns that companies use to compose this section.

1. Pre-type: r'\n\n(Risks Relat\w+? to [^\n]+)\s+' 

For the first type, the title of the section starts with "Risks Relating/Related to", therefore, this Regex pattern is able to match every risk disclosed in the reports.

2. Post-type: '[A-Z][A-Za-z,\s]* Risks'

The second pattern ends with "Risks", for example, "Regulatory Risks" or "Business Risks".

3. Irregular type: r'\n\n([^\n]+regulatory[^\n]+)\s+'

The last pattern doesn't have a fixed format, therefore, users have to manually apply Regex to match the pattern by themselves. In our case, company ticker 'BGS' and 'TSN' have a pattern like " About Regulatory Information" in their 10-K report. Hence, we use the above Regex pattern to provide fuzzy matching to their regulatory information.

r'\n\n(Risks Relat\w+? to [^\n]+)\s+'

##### 2.3.4 *product_portfolio(item1)*

The `product_portfolio()` function returns a dictionary whose keys are the paragraph titles and values are the context in the `Item 1` section of 10-K reports. Due to the inconsistent writing style across different companies, we are not able to provide exact match to companies' product portfolio information. However, we managed to store all the information into a dictionary such that users can use python script to filter out the key-value pairs they are interested in.

##### 2.3.5 *acquisitions(tiem8)*

The `acquisitions()` function extracts the companies' analysis of their recent acquisitions, which is usually stored in the `Item 8` section within their 10-K reports. However, we are not able to provide exact match to extract such information because of the inconsistent writing style. But we apply similar method to the way we extract the product_portfolio, that is, store the paragraph titles and the corresponding context into a dictionary for the further use of our users.

##### 2.3.4 *parsing_file()*

The `parsing_file()` method splits the entire report into different sections, for example Item 1, Item 1A, and Item 7. Then, we implement regular expressions and text preprocessing functions to process the context of these items, and eventually standardize and extract the useful text that might include the three variables.

In [13]:
parser.parsing_file()

Example: ADM's 2023 10-K report reveals their information in terms of regulatory compliance, product portfolio, and analysis of recent acquisitions. All these variables are stored as attributes of the company object.

In [14]:
print("Regulatory_compliance:", parser.regulatorycompliance)
print("----------------------")
print("Product_portfolio:", parser.productportfolio)
print("----------------------")
print("analysis of acquisitions:", parser.acquisitions)

Regulatory_compliance: {'Regulatory Risks': ' The Company is subject to numerous laws, regulations, and mandates globally which could adversely affect the Company’s operating results and forward strategy. The Company does business globally, connecting crops and markets in over 190 countries, and is required to comply with laws and regulations administered by the United States federal government as well as state, local, and non-U.S. governmental authorities in numerous areas including: accounting and income taxes, anti-corruption, anti-bribery, global trade, trade sanctions, privacy and security, environmental, product safety, and handling and production of regulated substances. The Company frequently faces challenges from U.S. and foreign tax authorities regarding the amount of taxes due including questions regarding the timing, amount of deductions, the allocation of income among various tax jurisdictions, and further risks related to changing tax laws domestically and globally. Any f

### 2.4 Combination

Therefore, we can create a separate DocScraper() object for each company, and obtain the tables of variables by combining their attributes.

In [None]:
FILE = '10-K'
Year = '2023'
api_key = '4443770e404975687133744ecfc296d86e74498fc8e5536678eae20c35423505'


# Read all the company tickers from the excel sheet.
master_idx = pd.read_excel("master_excel_all_variables.xlsx", sheet_name='master_excel_all_variables')
company_list = master_idx['company_ticker']

# Store the attributes into different tables according to the variables.
regulatory_compliance = pd.DataFrame(columns=['ticker', 'year', 'file', 'regulatory_compliance'])
product_portfolio = pd.DataFrame(columns=['ticker', 'year', 'file', 'product_portfolio'])
acquisition_info = pd.DataFrame(columns=['ticker', 'year', 'file', 'acquisition_info'])

for ticker in company_list:
    try:
        company = CIKScraper(ticker, headers)
        company.parsing_tickers()
        scraper = SECScraper(company.cik, Year, FILE, headers)
        url = scraper.scrape_sec_data()[0]

        try:
            parser = DocScraper(url, FILE, headers, api_key, Year, ticker)
            parser.parsing_file()


            # regulatory_compliance
            data_rc = {
            'ticker': ticker,
            'year': Year,
            'file': FILE,
            'regulatory_compliance': [parser.regulatorycompliance]
            }

            df_rc = pd.DataFrame(data_rc)
            regulatory_compliance = pd.concat([regulatory_compliance, df_rc], axis = 0)


            # product_portfolio
            data_pp = {
            'ticker': ticker,
            'year': Year,
            'file': FILE,
            'product_portfolio': [parser.productportfolio]
            }
            df_pp = pd.DataFrame(data_pp)
            product_portfolio = pd.concat([product_portfolio, df_pp], axis = 0)

            # acquisitions information
            data_ai = {
            'ticker': ticker,
            'year': Year,
            'file': FILE,
            'acquisition_info': [parser.acquisitions]
            }
            df_ai = pd.DataFrame(data_ai)
            acquisition_info = pd.concat([acquisition_info, df_ai], axis = 0)

        except Exception as e:
            print(e)

    except Exception as e:
        print("data is not available in:", ticker) 
        print(e)
        pass