# Usage Examples and Implementation Notes

This Notebook is meant to demonstrate a few usage examples of the application I developed for edgar debt scraping, as well as, provide some details and context around implementation.

A high level system diagram of the application is shown below.  The system was designed so that individual files could be processed in streaming fashion, meaning that applicable 10Qs would be lazily located and processed into final results in iterative fashion.

![title](systemDiagram.jpg)

## Section 1: Imports

Make sure that the root directory is on the python path.  May have to modify this depending upon where notebook is run from

In [1]:
import sys
sys.path.append("../") #make sure root edgarScraper directory is on pythonpath
from edgarScraper.edgarDebtScraper import EdgarDebtScraper

## Section 2: Usage Examples

The main application is the EdgarDebtScraper object.  It exposes a method called ```runJob``` which supports single and multiprocessing.  Results can be built in memory and returned as dataFrames, or streamed to disk.  Function docstring supplies more information.  



In [13]:
eds = EdgarDebtScraper()
? eds.runJob()

```
Signature: eds.runJob(outputFile=None, years=None, ciks=None, maxFiles=1000, nScraperProcesses=8, nIndexProcesses=8)
Docstring:
main entry method for scraping jobs.  Will write results to
the data directory in form <outputFile>_<year> and disclosures_year if
an outputFile is passed.  Otherwise, it will return a debtline item
dataframe and disclosures dataFrame if no outputFile is supplied.

Note:
    - if a list of specific ciks is supplied, maxFile limit is ignored,
    and the complete set of relevant urls will be eagerly built from
    a distributed search routing.  If no ciks are supplied it will
    lazily iterate through 10Q urls.
    - for large jobs supply an outputFile so that results can be
    periodically written to disk.  Otherwise, pandas dataFrames will
    be built in memory.

Args:
    outputFile: String name of file to write results to.
    years: list of years to restrict 10Q iteration to.
    ciks: list of ciks to restrict 10Q search to
    maxFiles: integer number of maximum files to iterate through
    nScraperProcesses: number of processes to use for processing 10Qs
    nIndexProcesses: number of processes to use for distributed cik
        search.

Returns:
    None if an outputFile is supplied.
    (dataFrame, dataFrame) if no outputFile is supplied 
File:      ~/citadel/edgarScraper/edgarScraper/edgarDebtScraper.py
Type:      method
```

### Example 1 - get 10Qs by year

The following cell shows an example job for processing the first 400 10Q files from the year 2010 using 4 processes.  It should take on the order of 1-2 minutes to run and return two dataFrames.  The first contains extracted line item information, the second contains free text debt disclosures.  Logging output can be suppressed by changing the logging level set in ```edgarScraper.config.log.py```

In [4]:
#sample job - get results for the first 400 10Qs in 2010.  Takes ~ 2mins.  
debtDf, disclosureDf = eds.runJob(
    years = [2010],
    maxFiles = 400,
    nScraperProcesses=8
)

2019-01-15 00:03:42,876 dailyIndLogger INFO Searching for 10Qs in https://www.sec.gov/Archives/edgar/daily-index/2010/QTR1/
2019-01-15 00:03:44,222 dailyIndLogger INFO Generated 25 10-Qs 
2019-01-15 00:03:44,618 dailyIndLogger INFO Generated 50 10-Qs 
2019-01-15 00:03:44,988 dailyIndLogger INFO Generated 75 10-Qs 
2019-01-15 00:03:44,990 dailyIndLogger INFO Generated 100 10-Qs 
2019-01-15 00:03:45,431 dailyIndLogger INFO Generated 125 10-Qs 
2019-01-15 00:03:45,830 dailyIndLogger INFO Generated 150 10-Qs 
2019-01-15 00:03:54,139 dailyIndLogger INFO Generated 175 10-Qs 
2019-01-15 00:03:55,979 dailyIndLogger INFO Generated 200 10-Qs 
2019-01-15 00:03:57,797 dailyIndLogger INFO Generated 225 10-Qs 
2019-01-15 00:04:02,290 edgarScraperLog INFO finished consuming file 100
2019-01-15 00:04:05,708 dailyIndLogger INFO Generated 250 10-Qs 
2019-01-15 00:04:08,929 dailyIndLogger INFO Generated 275 10-Qs 
2019-01-15 00:04:14,067 dailyIndLogger INFO Generated 300 10-Qs 
2019-01-15 00:04:17,058 ed

### Example 2 - Search by CIK

Another sample job with specific ciks.  This changes the behavior of the scraper slightly.  Since the daily index files provided by Edgar aren't indexed by company name or cik, a distributed search is first conducted to find
relevant 10Q filings.  This list is eagerly evaluated and then passed for further file processing.  Note: this is a consequence of my decision to not store the raw 10Q text files locally.  There are pros and cons to this.  A pro is that I don't have to do any raw data management - a definite advantage considering this is a prototype and there are GBs of data.  A con is that the raw data can't be re-indexed for different use cases.  

In [3]:
# search for specific cik for years 2010-2018 inclusive.  Takes about 2-3 mins to find relavent filings.
debtDf, disclosureDf = eds.runJob(
    years = list(range(2010,2019)),
    ciks = [1062822],
    nScraperProcesses=4
)

debtDf.head()

2019-01-14 23:54:00,234 dailyIndLogger INFO CIKS specified, calling distributed search routine,ignoring maxFiles limit
2019-01-14 23:54:00,764 dailyIndLogger INFO Searching for 10Qs in https://www.sec.gov/Archives/edgar/daily-index/2010/QTR1/
2019-01-14 23:54:01,029 dailyIndLogger INFO Searching for 10Qs in https://www.sec.gov/Archives/edgar/daily-index/2010/QTR2/
2019-01-14 23:54:01,270 dailyIndLogger INFO Searching for 10Qs in https://www.sec.gov/Archives/edgar/daily-index/2010/QTR3/
2019-01-14 23:54:01,526 dailyIndLogger INFO Searching for 10Qs in https://www.sec.gov/Archives/edgar/daily-index/2010/QTR4/
2019-01-14 23:54:02,059 dailyIndLogger INFO Searching for 10Qs in https://www.sec.gov/Archives/edgar/daily-index/2011/QTR1/
2019-01-14 23:54:02,392 dailyIndLogger INFO Searching for 10Qs in https://www.sec.gov/Archives/edgar/daily-index/2011/QTR2/
2019-01-14 23:54:02,624 dailyIndLogger INFO Searching for 10Qs in https://www.sec.gov/Archives/edgar/daily-index/2011/QTR3/
2019-01-14 23

Unnamed: 0,ACCOUNTSPAYABLEANDACCRUEDLIABILITIESCURRENT,ACCOUNTSPAYABLECURRENT,ACCOUNTSPAYABLEOTHERCURRENT,ACCOUNTSPAYABLERELATEDPARTIESCURRENT,ACCOUNTSPAYABLETRADECURRENT,ACCRUEDLIABILITIESCURRENT,BANKOVERDRAFTS,BRIDGELOAN,CAPITALLEASEOBLIGATIONSCURRENT,CAPITALLEASEOBLIGATIONSNONCURRENT,...,SENIORLONGTERMNOTES,SENIORNOTESCURRENT,SHORTTERMBANKLOANSANDNOTESPAYABLE,SHORTTERMBORROWINGS,SHORTTERMNONBANKLOANSANDNOTESPAYABLE,SUBORDINATEDDEBTCURRENT,SUBORDINATEDLONGTERMDEBT,UNSECUREDDEBTCURRENT,UNSECUREDLONGTERMDEBT,WAREHOUSEAGREEMENTBORROWINGS
0,79416000.0,66553000.0,,,,12863000.0,,,,,...,,,,,,,,,,
1,63316000.0,50451000.0,,,,12865000.0,,,,,...,,,,,,,,,,
2,9832.0,3547.0,,,,6285.0,,,,,...,,,,,,,,,,
3,10141.0,4773.0,,,,5368.0,,,,,...,,,,,,,,,,
4,2984.0,3550.0,,,,7437.0,,,,,...,,,,,,,,,,


## Section 3 Data Heirachy

The application attempts to find relevant debt-information for 71 different fields.  These fields and the accompanying taxonomy are taken from information found on https://xbrl.us/.  

Final short and long term debt levels are calculated based upon the following strategy:

  1) If values exist for key fields like ```LONGTERMDEBTNONCURRENT``` or ```DEBTCURRENT``` return these values as the final long and short-term debt levels.

  2) Else, attempt to form final results by aggregating up component subfields. 

  3) Finally, if the first two approaches fail, attempt to form results by taking values from parent-fields (usually total current / non current liabilities) and subtracting "sibling-level" fields where applicable.
    
For more details on this aggregation logic please see the source code contained in ```edgarScraper.pipelineIO.resultset.py```

As with other implementation decisions, there are pros and cons to the approach I took.  A pro is that the logic employed closely matches standard GAAP Taxonomies and allows for a robust and systematic way of determining overall debt levels.  Indeed, using this approach I was able to get viable values for close to 90% of all 10-Q filings from 1994-2018.  

**The disadvantage to this approach is that it can sometimes lead to apples-to-oranges type comparisons.  For instance, for a given 10-Q, the only short term debt field recovered may be a company's total current liabilities - either because the company provided little information or extraction faired poorly.  This value will likely overstate the company's short term debt (as it can include things like payroll and taxes).  For a different 10-Q, a more granular short-term debt field may be the only one resolved.  Under the scheme advanced, both values will appear as final short term debt levels.**  

### Data Field Groupings and Hierarchy
![title](fields.jpg)