Data for accounting research
This repository contains code to pull together data from various sources including:
Note that some of the data sets I use are proprietary, so the code will only work if you have access to the data in some form.
While not strictly necessary to use the scripts here, Git likely makes it easier to download and to update.
I keep all Git repositories in
~/git. So to get this repository, I could do:
cd ~/git git clone https://github.com/iangow/acct_data.git
This will create a copy of the repository in
Note that one can get updates to the repository by going to the directory and "pulling" the latest code:
cd ~/git/acct_data git pull
Alternatively, I think you could fork the repository on GitHub and then clone.
I think that cloning using the SSH URL (e.g.,
email@example.com:iangow/acct_data.git) is necessary for Git pulling and pushing to work well in RStudio.
Many of the scripts rely on Perl (I use MacPorts, which I think currently defaults to v5.16).
In addition, the Perl scripts generally interact with PostgreSQL using the Perl
DBD::Pg (see here).
I use MacPorts to install this
sudo port install p5-dbd-pg.
sudo apt-get install libdbi-perl libdbd-pg-perl would work.
You should have a PostgreSQL database to store the data.
There are also some data dependencies in that some scripts assume the existence of other data in the database.
For example, scripts that download filings generally refer to the PostgreSQL table
filings.filings created by the script get_filings.R.
A number of scripts here are Bash shell scripts. These should work on Linux or OS X, but not on Windows (unless you have Cygwin or something like it; see here).
I also assume that
psql (command-line interface to PostgreSQL) is on the path.
I have MacPorts on my path (in
~/.profile I set
export PATH=/opt/local/bin:/opt/local/sbin:$PATH) and I can ensure that PostgreSQL is on my path by setting
sudo port select postgresql postgresql94 (v9.4 being current at the time of writing).
6. Environment variables
I am migrating the scripts, etc., from using hard-coded values (e.g., my WRDS ID
iangow) to using environment variales.
Environment variables that I use include:
PGDATABASE: The name of the PostgreSQL database you use.
PGUSER: Your username on the PostgreSQL database.
PGHOST: Where the PostgreSQL database is to be found (this will be
localhostif its on the same machine as you're running the code on)
EDGAR_DIR: The local location of a partial mirror of EDGAR.
PGBACKUP_DIR: The directory where backups of PostgreSQL data created by
I set these environment variables in
export PGHOST="localhost" export PGDATABASE="mydb" export EDGAR_DIR="/Volumes/2TB/data" export PGUSER="iangow" export PGBACKUP_DIR="~/Dropbox/pg_backup/"
I also set them in
~/.Rprofile, as RStudio doesn't seem to pick up the settings in
~/.profile in recent versions of OS X:
Sys.setenv(EDGAR_DIR="/Volumes/2TB/data") Sys.setenv(PGHOST="localhost") Sys.setenv(PGDATABASE="mydb") Sys.setenv(PGBACKUP_DIR="~/Dropbox/pg_backup/")