# Accessing WRDS

&copy; **Johannes Ruf** (comments welcome under j.ruf@lse.ac.uk)

There are different ways to access data in WRDS, specifically CRSP data:

1. Access via the [WRDS cloud](https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-python/python-wrds-cloud/). 
2. Access via SAS ("Statistical Analysis System"), a statistical software that is suited to access large datasets. -- Historically, this has been the standard way, and there is a lot of documentation available.
3. Access via Jupyter/Spyder and SQL. Data are queried via SQL in a Jupyter notebook and returned as a `pandas` dataframe.
4. Access via the [WRDS website](https://wrds-www.wharton.upenn.edu/). Data can be downloaded as csv files (or other formats).
5. [CRSP](http://www.crsp.org/) has direct distribution channels. At university, we don't have access through these routes.

Method 3 is useful for reproducibility. -- One can just share the program and other researchers (or your future self) know exactly which data was downloaded.  In order to avoid repeated downloads, one would store the downloaded data locally.  In contrast, Method 4 (the web interface) requires some steps (namely choosing the data on the interface) that are not automatically documented.

Disadvantages of Method 3:
 * More complicated to use; especially when you download large data. 
 * Data are less cleaned and need more preprocessing, compared to the data from the web interface. 
 * Requires a good internet connection while running the program. 
 * For modestly large datasets can take too long.
 * Server potentially might be down.
 
Peronally, I have used mostly Method 4, since Method 3 has been introduced only around 2017/18. However, I'd advocate in favor of Method 3 whenever possible.

There are different sources to get help online. Google searches often lead to answers, the WRDS website has lots of documentation and tutorials, and the [CRSP website](https://crsp.org/products/research-products/crsp-us-stock-databases) provide online documentation.

You will need to install the WRDS library using Pip at the command line: `pip install wrds`
(On Mac via its Terminal and on Windows via the the Anaconda Prompt app).

In [27]:
import pandas as pd
import wrds

wrds_login = 'xxx'    # update to your login info on CRSP

In [28]:
db = wrds.Connection(wrds_username=wrds_login)

Loading library list...
Done


The first time when you're running this command, you'll have to put in your password.  To store it on your computer and have faster access, run the following command.
(Note that the password is stored on your computer in plain text; hence do not use your standard password to access WRDS!)

In [3]:
#db.create_pgpass_file()

WRDS datasets are arranged into libraries (such as crsp or optionm[etrics]), each library containing several tables. In order to know which library to use, see the WRDS webpage for their names. (You will only have access to those data that your institution is subscribed to).

In [4]:
print('List of all WRDS libraries: ', *sorted(db.list_libraries()), sep='\n')

List of all WRDS libraries: 
aha
aha_sample
ahasamp
audit
audit_audit_comp
audit_common
audit_corp_legal
audit_europe
auditsmp
bank
blab
block
boardex
boardex_eur
boardex_na
boardex_row
boardex_trial
boardex_uk
boardsmp
bvd
bvdsamp
calcbench_trial
calcbnch
cboe
centris
ciq
ciqsamp
ciqsamp_common
ciqsamp_common_new
ciqsamp_transcripts
cisdm
cisdmsmp
clrvt
clrvtsmp
columnar
comp
comp_bank
comp_bank_daily
comp_emdb_daily
comp_emdb_monthly
comp_execucomp
comp_global
comp_global_daily
comp_segments_hist
comp_segments_hist_daily
compa
compb
compbd
compdcur
compg
compgd
comph
compm
compmcur
compnad
compsamp
compsamp_snapshot
compseg
compsegd
compsnap
comscore
contrib
contrib_ceo_turnover
contrib_ceo_turnover_new
contrib_char_returns
contrib_general
contrib_general_new
contrib_intangible_value
contrib_kpss
contrib_liva
contrib_shale
crsp
crsp_a_ccm
crsp_a_indexes
crsp_a_indexes_new
crsp_a_stock
crsp_a_stock_new
crsp_a_treasuries
crsp_q_indexhist
crsp_q_mutualfunds
crspa
crspm
crspq
crspsamp
cs

In [5]:
print('List of all CRSP tables: ', *sorted(db.list_tables(library='crsp')), sep='\n')

List of all CRSP tables: 
acti
asia
asib
asic
asio
asix
bmdebt
bmheader
bmpaymts
bmquotes
bmyield
bndprt06
bndprt12
bxcalind
bxdlyind
bxmthind
bxquotes
bxyield
cap
ccm_lookup
ccm_qvards
ccmxpf_linktable
ccmxpf_lnkhist
ccmxpf_lnkrng
ccmxpf_lnkused
comphead
comphist
compmaster
contact_info
crsp_cik_map
crsp_daily_data
crsp_header
crsp_monthly_data
crsp_names
crsp_portno_map
crsp_ziman_daily_index
crsp_ziman_monthly_index
cs20yr
cs5yr
cs90d
cst_hist
daily_nav
daily_nav_ret
daily_returns
dividends
dport1
dport2
dport3
dport4
dport5
dport6
dport7
dport8
dport9
dsbc
dsbo
dse
dse62
dse62delist
dse62dist
dse62exchdates
dse62names
dse62nasdin
dse62shares
dseall
dseall62
dsedelist
dsedist
dseexchdates
dsenames
dsenasdin
dseshares
dsf
dsf62
dsf62_v2
dsf_v2
dsfhdr
dsfhdr62
dsi
dsi62
dsia
dsib
dsic
dsio
dsir
dsix
dsiy
dsp500
dsp500_v2
dsp500list
dsp500p
dssc
dsso
eod_cap
eod_sector
eod_vg
erdport1
erdport2
erdport3
erdport4
erdport5
erdport6
erdport7
erdport8
erdport9
ermport1
ermport2
ermport3
erm

Some of these tables, e.g. `dsf`, have a second version, e.g., `dsf_v2`. These new versions correspond to an update of the data format, rolled out in July 2022.

We can obtain descriptions of the tables as follows:

In [6]:
col_headers = db.describe_table(library='crsp', table='crsp_daily_data')

Approximately 1815202 rows in crsp.crsp_daily_data.


In [7]:
col_headers

Unnamed: 0,name,nullable,type
0,permno,True,DOUBLE_PRECISION
1,caldt,True,DATE
2,prc,True,DOUBLE_PRECISION
3,usdprc,True,DOUBLE_PRECISION
4,usdprcdt,True,DATE
5,usdprctype,True,DOUBLE_PRECISION
6,valid,True,DOUBLE_PRECISION
7,usdshr,True,DOUBLE_PRECISION
8,cap,True,DOUBLE_PRECISION
9,usdret,True,DOUBLE_PRECISION


Here is one way to get data from a specific table:

In [8]:
stocknames = db.get_table(library='crsp', table='stocknames_v2', obs=10)

In [9]:
stocknames

Unnamed: 0,permno,permco,namedt,nameenddt,securitybegdt,securityenddt,hdrcusip,hdrcusip9,cusip,cusip9,...,primaryexch,conditionaltype,tradingstatusflg,shareclass,sharetype,securitytype,securitysubtype,usincflg,issuertype,siccd
0,10000.0,7952.0,1986-01-07,1987-06-11,1986-01-07,1987-06-11,68391610,683916100,68391610,683916100,...,Q,RW,A,A,NS,EQTY,COM,Y,ACOR,3990.0
1,10001.0,7953.0,1986-01-09,1993-11-21,1986-01-09,2017-08-03,36720410,367204104,39040610,390406106,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0
2,10001.0,7953.0,1993-11-22,2008-02-04,1986-01-09,2017-08-03,36720410,367204104,29274A10,29274A105,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0
3,10001.0,7953.0,2008-02-05,2009-08-03,1986-01-09,2017-08-03,36720410,367204104,29274A20,29274A204,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0
4,10001.0,7953.0,2009-08-04,2009-12-17,1986-01-09,2017-08-03,36720410,367204104,29269V10,29269V106,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0
5,10001.0,7953.0,2009-12-18,2010-07-08,1986-01-09,2017-08-03,36720410,367204104,29269V10,29269V106,...,A,RW,A,,NS,EQTY,COM,Y,CORP,4925.0
6,10001.0,7953.0,2010-07-09,2017-08-03,1986-01-09,2017-08-03,36720410,367204104,36720410,367204104,...,A,RW,A,,NS,EQTY,COM,Y,CORP,4925.0
7,10002.0,7954.0,1986-01-10,1993-09-29,1986-01-10,2013-02-15,05978R10,05978R107,60740110,607401106,...,Q,RW,A,,NS,EQTY,COM,Y,ACOR,6710.0
8,10002.0,7954.0,1993-09-30,1999-06-30,1986-01-10,2013-02-15,05978R10,05978R107,83623410,836234104,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,6710.0
9,10002.0,7954.0,1999-07-01,2002-05-14,1986-01-10,2013-02-15,05978R10,05978R107,83623410,836234104,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,6020.0


Alternatively, we can directly use SQL:

In [10]:
db.raw_sql('SELECT * FROM crsp.stocknames_v2 LIMIT 10')

Unnamed: 0,permno,permco,namedt,nameenddt,securitybegdt,securityenddt,hdrcusip,hdrcusip9,cusip,cusip9,...,primaryexch,conditionaltype,tradingstatusflg,shareclass,sharetype,securitytype,securitysubtype,usincflg,issuertype,siccd
0,10000.0,7952.0,1986-01-07,1987-06-11,1986-01-07,1987-06-11,68391610,683916100,68391610,683916100,...,Q,RW,A,A,NS,EQTY,COM,Y,ACOR,3990.0
1,10001.0,7953.0,1986-01-09,1993-11-21,1986-01-09,2017-08-03,36720410,367204104,39040610,390406106,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0
2,10001.0,7953.0,1993-11-22,2008-02-04,1986-01-09,2017-08-03,36720410,367204104,29274A10,29274A105,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0
3,10001.0,7953.0,2008-02-05,2009-08-03,1986-01-09,2017-08-03,36720410,367204104,29274A20,29274A204,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0
4,10001.0,7953.0,2009-08-04,2009-12-17,1986-01-09,2017-08-03,36720410,367204104,29269V10,29269V106,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0
5,10001.0,7953.0,2009-12-18,2010-07-08,1986-01-09,2017-08-03,36720410,367204104,29269V10,29269V106,...,A,RW,A,,NS,EQTY,COM,Y,CORP,4925.0
6,10001.0,7953.0,2010-07-09,2017-08-03,1986-01-09,2017-08-03,36720410,367204104,36720410,367204104,...,A,RW,A,,NS,EQTY,COM,Y,CORP,4925.0
7,10002.0,7954.0,1986-01-10,1993-09-29,1986-01-10,2013-02-15,05978R10,05978R107,60740110,607401106,...,Q,RW,A,,NS,EQTY,COM,Y,ACOR,6710.0
8,10002.0,7954.0,1993-09-30,1999-06-30,1986-01-10,2013-02-15,05978R10,05978R107,83623410,836234104,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,6710.0
9,10002.0,7954.0,1999-07-01,2002-05-14,1986-01-10,2013-02-15,05978R10,05978R107,83623410,836234104,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,6020.0


It's important that after obtaining the data, you close the acccess to the WRDS database:

In [11]:
db.close()

The `pandas` dataframes are still available after closing the link to the database:

In [12]:
stocknames.head()

Unnamed: 0,permno,permco,namedt,nameenddt,securitybegdt,securityenddt,hdrcusip,hdrcusip9,cusip,cusip9,...,primaryexch,conditionaltype,tradingstatusflg,shareclass,sharetype,securitytype,securitysubtype,usincflg,issuertype,siccd
0,10000.0,7952.0,1986-01-07,1987-06-11,1986-01-07,1987-06-11,68391610,683916100,68391610,683916100,...,Q,RW,A,A,NS,EQTY,COM,Y,ACOR,3990.0
1,10001.0,7953.0,1986-01-09,1993-11-21,1986-01-09,2017-08-03,36720410,367204104,39040610,390406106,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0
2,10001.0,7953.0,1993-11-22,2008-02-04,1986-01-09,2017-08-03,36720410,367204104,29274A10,29274A105,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0
3,10001.0,7953.0,2008-02-05,2009-08-03,1986-01-09,2017-08-03,36720410,367204104,29274A20,29274A204,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0
4,10001.0,7953.0,2009-08-04,2009-12-17,1986-01-09,2017-08-03,36720410,367204104,29269V10,29269V106,...,Q,RW,A,,NS,EQTY,COM,Y,CORP,4920.0


## Example of a workflow

In [13]:
db = wrds.Connection(wrds_username=wrds_login)

Loading library list...
Done


We shall use the CRSP Daily Stock File (`dsf_v2`).

In [14]:
db.describe_table('crsp', 'dsf_v2')

Approximately 101703584 rows in crsp.dsf_v2.


Unnamed: 0,name,nullable,type
0,permno,True,DOUBLE_PRECISION
1,hdrcusip,True,VARCHAR(8)
2,permco,True,DOUBLE_PRECISION
3,siccd,True,DOUBLE_PRECISION
4,nasdissuno,True,DOUBLE_PRECISION
5,yyyymmdd,True,DOUBLE_PRECISION
6,dlycaldt,True,DATE
7,dlydelflg,True,VARCHAR(1)
8,dlyprc,True,DOUBLE_PRECISION
9,dlyprcflg,True,VARCHAR(2)


In [15]:
db.get_table('crsp', 'dsf_v2', obs=100)

Unnamed: 0,permno,hdrcusip,permco,siccd,nasdissuno,yyyymmdd,dlycaldt,dlydelflg,dlyprc,dlyprcflg,...,dlybid,dlyask,dlyopen,dlynumtrd,dlymmcnt,dlyprcvol,disfacpr,disfacshr,disexdt,shrout
0,10000.0,68391610,7952.0,3990.0,10396.0,19860107.0,1986-01-07,N,2.56250,BA,...,2.3750,2.7500,,,9.0,2562.500,,,,3680.0
1,10000.0,68391610,7952.0,3990.0,10396.0,19860108.0,1986-01-08,N,2.50000,BA,...,2.3750,2.6250,,,9.0,32000.000,,,,3680.0
2,10000.0,68391610,7952.0,3990.0,10396.0,19860109.0,1986-01-09,N,2.50000,BA,...,2.3750,2.6250,,,9.0,3500.000,,,,3680.0
3,10000.0,68391610,7952.0,3990.0,10396.0,19860110.0,1986-01-10,N,2.50000,BA,...,2.3750,2.6250,,,10.0,21250.000,,,,3680.0
4,10000.0,68391610,7952.0,3990.0,10396.0,19860113.0,1986-01-13,N,2.62500,BA,...,2.5000,2.7500,,,10.0,14306.250,,,,3680.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,10000.0,68391610,7952.0,3990.0,10396.0,19860521.0,1986-05-21,N,3.65625,BA,...,3.5000,3.8125,,,8.0,68737.500,,,,3793.0
96,10000.0,68391610,7952.0,3990.0,10396.0,19860522.0,1986-05-22,N,3.43750,BA,...,3.3750,3.5000,,,8.0,13750.000,,,,3793.0
97,10000.0,68391610,7952.0,3990.0,10396.0,19860523.0,1986-05-23,N,3.46875,BA,...,3.3125,3.6250,,,7.0,3815.625,,,,3793.0
98,10000.0,68391610,7952.0,3990.0,10396.0,19860527.0,1986-05-27,N,3.46875,BA,...,3.3125,3.6250,,,7.0,12834.375,,,,3793.0


Here is how we select only a few columns with SQL, along with an if statement:

In [16]:
db.raw_sql("SELECT hdrcusip, permno, dlycaldt, dlylow, dlyhigh "
           "FROM crsp.dsf_v2 "
           "WHERE permno IN (14593, 90319, 12490, 17778) "
           "AND dlycaldt='2013-01-04'")

Unnamed: 0,hdrcusip,permno,dlycaldt,dlylow,dlyhigh
0,45920010,12490.0,2013-01-04,192.78,194.46
1,03783310,14593.0,2013-01-04,525.8286,538.6299
2,08467010,17778.0,2013-01-04,140047.0,141003.8
3,02079K30,90319.0,2013-01-04,727.6801,741.47


Quotation marks along several lines can also be used in the following style:

In [17]:
db.raw_sql("""
    SELECT hdrcusip, permno, dlycaldt, dlylow, dlyhigh 
    FROM crsp.dsf_v2 
    WHERE permno IN (14593, 90319, 12490, 17778) 
    AND dlycaldt='2013-01-04'
    """)

Unnamed: 0,hdrcusip,permno,dlycaldt,dlylow,dlyhigh
0,45920010,12490.0,2013-01-04,192.78,194.46
1,03783310,14593.0,2013-01-04,525.8286,538.6299
2,08467010,17778.0,2013-01-04,140047.0,141003.8
3,02079K30,90319.0,2013-01-04,727.6801,741.47


Let's check when the ask price is over $2,000 between the years 2010 and 2013 for these four companies:

In [18]:
db.raw_sql("""
    SELECT hdrcusip, permno, dlycaldt, dlylow, dlyhigh
    FROM crsp.dsf_v2 
    WHERE permno IN (14593, 90319, 12490, 17778) 
    AND dlycaldt BETWEEN '2010-01-01' AND '2013-12-31' AND dlyhigh > 2000
    """)

Unnamed: 0,hdrcusip,permno,dlycaldt,dlylow,dlyhigh
0,08467010,17778.0,2010-01-04,99201.0,99910.0
1,08467010,17778.0,2010-01-05,99550.0,100001.0
2,08467010,17778.0,2010-01-06,99500.0,100000.0
3,08467010,17778.0,2010-01-07,99594.0,100000.0
4,08467010,17778.0,2010-01-08,99700.0,100300.0
...,...,...,...,...,...
1001,08467010,17778.0,2013-12-24,175555.1,176063.0
1002,08467010,17778.0,2013-12-26,175610.0,176900.0
1003,08467010,17778.0,2013-12-27,176678.0,177320.0
1004,08467010,17778.0,2013-12-30,176655.0,177685.0


All companies that ever had an ask high price higher than 2000:

In [19]:
db.raw_sql('SELECT DISTINCT permno FROM crsp.dsf_v2 WHERE dlyhigh > 2000')

Unnamed: 0,permno
0,14542.0
1,14752.0
2,15395.0
3,16280.0
4,17486.0
5,17778.0
6,17881.0
7,21709.0
8,36281.0
9,76605.0


Let's now check which companies these are, by joining different databases.

In [20]:
db.raw_sql("""
    SELECT DISTINCT crsp.dsf_v2.permno, crsp.stocknames_v2.issuernm
    FROM crsp.dsf_v2 
    JOIN
    crsp.stocknames_v2
    ON crsp.dsf_v2.permno = crsp.stocknames_v2.permno
    WHERE crsp.dsf_v2.dlyhigh > 2000
    """)

Unnamed: 0,permno,issuernm
0,14542.0,ALPHABET INC
1,14542.0,GOOGLE INC
2,14752.0,TEXAS PACIFIC LAND TRUST
3,15395.0,CABLE ONE INC
4,16280.0,KELSEY HAYES CO
5,16280.0,KELSEY HAYES WHEEL CORP
6,16280.0,KELSEY HAYES WHEEL INC
7,16280.0,KELSEY WHEEL CO
8,17486.0,SPENCER KELLOGG & SONS INC
9,17778.0,BERKSHIRE HATHAWAY INC DEL


If we have variables, we could use SQL variables. An alternative that I prefer is to use Python variables directly as follows.

In [21]:
y = 16280

In [22]:
pd.set_option('display.max_columns', None)    #just to display all columns

In [23]:
db.raw_sql("SELECT * FROM crsp.stocknames_v2 WHERE permno = '{}'".format(y))

Unnamed: 0,permno,permco,namedt,nameenddt,securitybegdt,securityenddt,hdrcusip,hdrcusip9,cusip,cusip9,ticker,issuernm,primaryexch,conditionaltype,tradingstatusflg,shareclass,sharetype,securitytype,securitysubtype,usincflg,issuertype,siccd
0,16280.0,22677.0,1925-12-31,1927-06-29,1925-12-31,1973-10-31,48818810,488188103,,,,KELSEY WHEEL CO,N,RW,A,,NS,EQTY,COM,Y,ACOR,3710.0
1,16280.0,22677.0,1927-06-30,1933-02-08,1925-12-31,1973-10-31,48818810,488188103,,,,KELSEY HAYES WHEEL CORP,N,RW,A,,NS,EQTY,COM,Y,ACOR,3710.0
2,16280.0,22677.0,1933-02-09,1956-12-19,1925-12-31,1973-10-31,48818810,488188103,,,,KELSEY HAYES WHEEL INC,N,RW,A,B,NS,EQTY,COM,Y,ACOR,3710.0
3,16280.0,22677.0,1956-12-20,1962-07-01,1925-12-31,1973-10-31,48818810,488188103,,,,KELSEY HAYES CO,N,RW,A,,NS,EQTY,COM,Y,ACOR,3710.0
4,16280.0,22677.0,1962-07-02,1968-01-01,1925-12-31,1973-10-31,48818810,488188103,,,KW,KELSEY HAYES CO,N,RW,A,,NS,EQTY,COM,Y,ACOR,3714.0
5,16280.0,22677.0,1968-01-02,1973-10-31,1925-12-31,1973-10-31,48818810,488188103,48818810.0,488188103.0,KW,KELSEY HAYES CO,N,RW,A,,NS,EQTY,COM,Y,ACOR,3714.0


In [24]:
pd.reset_option('display.max_columns')     # reset to standard number of columns of displayed

Again, let's not forget to close the link to the database:

In [25]:
db.close()

## Exercises

 1. Find out the permno that had a stock price over 2000 for the most trading days.
 2. Practise / learn some SQL via the [murder mystery](https://mystery.knightlab.com/).

CRSP = Centre for Reserach in SEcurity prices
price series beginning from December 31, 1925