Skip to content
MD&A sections from 10-Ks; 2002-2018
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE More details and a pretty picture. Jun 16, 2019
download_log_filelist.txt List of files and flag for having section Jun 15, 2019
example_analysis.png Example of how to use the data Jun 16, 2019 Example python code to process text files Jun 15, 2019

MD&A text from 10-K filings

This repository contains an "index" file that can be used to process the raw MD&A data parsed using this code, which is a fork/update of this repo. After following the instructions below, you can have your own panel database of public firm MD&A text.

As an example of how this data can be used, Ewens, Peters and Wang (2019) searched all the MD&A sections for references to personnel or human capital risk (following Eisfeldt and Papanikolaou, 2013) to sort public firms and compare their organizational capital stocks:

Org. stock sorts

If you don't want to run the scraper code yourself (this or this)-- it takes days -- then you can follow the instructions below to get the raw MD&A data gathered for Ewens, Peters and Wang (2019). The basic idea is to have the local, already-scraped MD&A sections as text files and grab the ones that you need using the index file and your programming language of choice. The resulting data will be (File,CompanyName,CIK,SIC,ReportDate,Section,Extracted_info).

Note that this repository does not actually have the txt files of MD&A text. See the instructions below for the temporary solution to that problem.

Please cite Ewens, Peters and Wang (2019) work "Acquisition prices and the measurement of intangible capital" if you use the data, Berns, Bick, Flugum and Houston if you use the scraper and make sure to check the license rules for either.

How to use the index file to build your data

  • load the txt index file. This file is a simple list of the file names of MD&A text from the original download. Note that you could probably just get the full set of statements, save the output of ls or similar to a variable and loop over the files.
  • the column Filer is a mapping to the raw text files associated with the 10-K scrape. So for row 6 in that file, the respective txt file is 119173.txt
  • download the txt files from here (they are too big for Github). For space reasons, the files were split into folders. They will have to be combined into one folder if the code below is to work.
  • We ask that this data not be forwarded on to others. This ensures that everyone receives the latest version of the data that's available.
  • use the Python script -- or your code of choice -- to loop through the text files and grab what you need. The text files have headers (unfortunately, sometimes more than one because data) like this:
CIK: 0001126956
SIC: 4924
FILE DATE: 20161115

So with the txt file loaded, all the identifying information is available.

Do file example

* Make sure that the folder of text files is somewhere 
cd ~/Link/To/MDNA_text/

* Load the index (you have downloaded it already)
insheet using "download_log_filelist.txt", tab clear

* Filenames (this is the "trick")
tostring filer, force replace
gen filename = filer + ".txt"

  * Now you have a list of file names associated with each row of interest

global importfolder "~/Link/To/MDNA_text/files"
global filelist : dir "$importfolder"  *.txt
 * Now we have the file list to do your magic on
 * Regex to grab CIK / Date / etc (the Python code below has examples)
* You could import each filename in a loop and save as a variable...thay can get big VERY fast, so be careful

Python code

The original scraper has an "MDA Cleaner and Tone" script that can be modified to your liking. The main pieces are with open(download, 'r') as txtfile: and the NEGATIVE or POSITIVE definitions. This repository also has the python script for Ewens, Peters and Wang (2019), which was interested in counts of words, rather than tone.

Here are the keywords that we searched for

sayings=["the following discussion","this discussion and analysis","should be read in conjunction", "should be read together with", "the following managements discussion and analysis"]
personnel = ["key personnel", "personnel","talented employee", "key talent"]

Bibtex citation

 title={Acquisition prices and the measurement of intangible capital},
 author={Ewens, Michael and Peters, Ryan and Wang, Sean},
 journal={Working Paper}
You can’t perform that action at this time.