# Datasets:

## Financial Statements

The financial dataset was obtained from SEC Edgar financial statement data set, which includes the company balance sheet, income statement and statement of cash flows. The data is provided quarterly since January 2009 to June 2024, which is the most recent dataset as of the writing of this proposal. SEC (January 2009 - June 2024). The SEC provides this data set using eXtensible Business Reporting Language (XBRL) which divides the dataset amongst many disjoint tables SEC (2024). In order to provide the Large Language model with a single set of tables we will use the following helper tool to process the dataset into a single data frame HansjoergW (2024).From this statement we will then use the following formulas to calculate a comprehensive set of financial ratios that will be provided. From this we will be able to create a dataset similar to that used in Kim et al. (2024).

Github Repo: https://github.com/HansjoergW/sec-fincancial-statement-data-set/tree/main



# Initial Setup

In [None]:
# to ensure that the logging statements are shown in juypter output, run this cell
import logging
import pandas as pd

logger = logging.getLogger()
logger.setLevel(logging.INFO)

# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width',1000)

# Configuration

In order to be used, the library needs to know where to store the compressed files from the Financial Statement Data Sets and where to store the sqlite database file. This is configured in a configuration file.

If you don't provide a config file, one will be created the first time you use the api. The configuration file will be created inside your home directory. You can then change the content of it or directly start with the downloading of the data.

[DEFAULT]
downloaddirectory = <userhome>/secfsdstools/data/dld
parquetdirectory = <userhome>/secfsdools/data/parquet
dbdirectory = <userhome>/secfsdstools/data/db
useragentemail = your.email@goeshere.com

The following tasks will be executed:

All currently available zip-files are downloaded form sec.gov (these are over 50 files that will need over 2 GB of space on your local drive)
All the zipfiles are transformed and stored as parquet files. Per default, the zipfile is deleted afterwards. If you want to keep the zip files, set the parameter 'KeepZipFiles' in the config file to True.
An index inside a sqlite db file is created
This may take a few minutes.

If you don't call update "manually", then the first time you call a function from the library, a download will be triggered.

Moreover, at most once a day, it is checked if there is a new zip file available on sec.gov. If there is, a download will be started automatically. If you don't want 'auto-update', set the 'AutoUpdate' in your config file to False. The new quarter zip files are available by the beginning of every quarter (January, April, July, October), hence, yo have to run the update() at the beginning of every quarter to get the data for the reprots from last quarter.

Note: the first time downloading data will take a couple of minutes, since over 2 GB of data will be downloaded and converted into parquet format.

Note: If you plan to use Jupyter, make sure that you configure the directories at a location where your Jupyter process has access. The used default directory (your user home directory) will work.

In [None]:
from secfsdstools.update import update

update()

# Index Search

Determine Central Index Key (CIK) given a Company Name. Then use CIK to obtain list of all reports, then use list of all reports to obtain Accession Number (ADSH) to obtain information about specific report.

In [29]:
from secfsdstools.c_index.searching import IndexSearch

index_search = IndexSearch.get_index_search()
results = index_search.find_company_by_name("APPLE INC")
results

2024-10-01 12:16:37,751 [INFO] configmgt  reading configuration from /Users/joseluistejada/.secfsdstools.cfg


Unnamed: 0,name,cik
0,APPLE INC,320193


In [17]:
from secfsdstools.c_index.companyindexreading import CompanyIndexReader

apple_cik = 320193
apple_index_reader = CompanyIndexReader.get_company_index_reader(cik=apple_cik)

2024-10-01 12:07:54,349 [INFO] configmgt  reading configuration from /Users/joseluistejada/.secfsdstools.cfg


In [30]:
allReportsApple = apple_index_reader.get_all_company_reports_df()
allReportsApple

Unnamed: 0,adsh,cik,name,form,filed,period,fullPath,originFile,originFileType,url
0,0001140361-24-024352,320193,APPLE INC,8-K,20240503,20240430,/Users/joseluistejada/secfsdstools/data/parquet/quarter/2024q2.zip,2024q2.zip,quarter,https://www.sec.gov/Archives/edgar/data/320193/000114036124024352/0001140361-24-024352-index.htm
1,0000320193-24-000067,320193,APPLE INC,8-K,20240502,20240430,/Users/joseluistejada/secfsdstools/data/parquet/quarter/2024q2.zip,2024q2.zip,quarter,https://www.sec.gov/Archives/edgar/data/320193/000032019324000067/0000320193-24-000067-index.htm
2,0000320193-24-000069,320193,APPLE INC,10-Q,20240503,20240331,/Users/joseluistejada/secfsdstools/data/parquet/quarter/2024q2.zip,2024q2.zip,quarter,https://www.sec.gov/Archives/edgar/data/320193/000032019324000069/0000320193-24-000069-index.htm
3,0001140361-24-010155,320193,APPLE INC,8-K,20240228,20240229,/Users/joseluistejada/secfsdstools/data/parquet/quarter/2024q1.zip,2024q1.zip,quarter,https://www.sec.gov/Archives/edgar/data/320193/000114036124010155/0001140361-24-010155-index.htm
4,0001308179-24-000010,320193,APPLE INC,DEF 14A,20240111,20240229,/Users/joseluistejada/secfsdstools/data/parquet/quarter/2024q1.zip,2024q1.zip,quarter,https://www.sec.gov/Archives/edgar/data/320193/000130817924000010/0001308179-24-000010-index.htm
...,...,...,...,...,...,...,...,...,...,...
99,0001193125-10-088957,320193,APPLE INC,10-Q,20100421,20100331,/Users/joseluistejada/secfsdstools/data/parquet/quarter/2010q2.zip,2010q2.zip,quarter,https://www.sec.gov/Archives/edgar/data/320193/000119312510088957/0001193125-10-088957-index.htm
100,0001193125-10-012085,320193,APPLE INC,10-Q,20100125,20091231,/Users/joseluistejada/secfsdstools/data/parquet/quarter/2010q1.zip,2010q1.zip,quarter,https://www.sec.gov/Archives/edgar/data/320193/000119312510012085/0001193125-10-012085-index.htm
101,0001193125-10-012091,320193,APPLE INC,10-K/A,20100125,20090930,/Users/joseluistejada/secfsdstools/data/parquet/quarter/2010q1.zip,2010q1.zip,quarter,https://www.sec.gov/Archives/edgar/data/320193/000119312510012091/0001193125-10-012091-index.htm
102,0001193125-09-214859,320193,APPLE INC,10-K,20091027,20090930,/Users/joseluistejada/secfsdstools/data/parquet/quarter/2009q4.zip,2009q4.zip,quarter,https://www.sec.gov/Archives/edgar/data/320193/000119312509214859/0001193125-09-214859-index.htm


## Collect single report

In [34]:
adsh = "0000320193-24-000069"

In [36]:
from secfsdstools.e_collector.reportcollecting import SingleReportCollector
from secfsdstools.e_filter.rawfiltering import ReportPeriodAndPreviousPeriodRawFilter
from secfsdstools.e_presenter.presenting import StandardStatementPresenter

# us a Collector to grab the data of the 10-K report. an filter for balancesheet information
collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(
      adsh=adsh,
      stmt_filter=["BS"]
)  
rawdatabag = collector.collect() # load the data from the disk


bs_df = (rawdatabag
                   # ensure only data from the period (2022) and the previous period (2021) is in the data
                   .filter(ReportPeriodAndPreviousPeriodRawFilter())
                   # join the the content of the pre_txt and num_txt together
                   .join()  
                   # format the data in the same way as it appears in the report
                   .present(StandardStatementPresenter())) 
print(bs_df) 

2024-10-01 12:40:02,402 [INFO] configmgt  reading configuration from /Users/joseluistejada/.secfsdstools.cfg


                    adsh coreg                                              tag       version stmt  report  line     uom  negating  inpth  qrtrs_0/20240331  qrtrs_0/20230331
0   0000320193-24-000069                  CashAndCashEquivalentsAtCarryingValue  us-gaap/2023   BS       4     3     USD         0      0      3.269500e+10               NaN
1   0000320193-24-000069                            MarketableSecuritiesCurrent  us-gaap/2023   BS       4     4     USD         0      0      3.445500e+10               NaN
2   0000320193-24-000069                           AccountsReceivableNetCurrent  us-gaap/2023   BS       4     5     USD         0      0      2.183700e+10               NaN
3   0000320193-24-000069                             NontradeReceivablesCurrent  us-gaap/2023   BS       4     6     USD         0      0      1.931300e+10               NaN
4   0000320193-24-000069                                           InventoryNet  us-gaap/2023   BS       4     7     USD         0

In [38]:
adsh_8k = "0001140361-24-010155"
# us a Collector to grab the data of the 10-K report. an filter for balancesheet information
collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(
      adsh=adsh_8k,
)  
rawdatabag = collector.collect() # load the data from the disk


bs_df = (rawdatabag
                   # ensure only data from the period (2022) and the previous period (2021) is in the data
                   .filter(ReportPeriodAndPreviousPeriodRawFilter())
                   # join the the content of the pre_txt and num_txt together
                   .join()  
                   # format the data in the same way as it appears in the report
                   .present(StandardStatementPresenter())) 
print(bs_df) 

2024-10-01 12:42:53,961 [INFO] configmgt  reading configuration from /Users/joseluistejada/.secfsdstools.cfg


Empty DataFrame
Columns: [adsh, coreg, tag, version, stmt, report, line, uom, negating, inpth]
Index: []
