# WRDS "last updated" metadata and parquet files

## Introduction

Recently, WRDS has made two changes to the data in its PostgreSQL database.

The first change is that several tables have been updated with the correct data types.
For example, `permno` and `permco` are now integers in `crsp.dsf`.
As a result, on no longer needs to specify `col_types={'lpermno': 'int32', 'lpermco': 'int32'}` when running `wrds_update_pq('dsf', 'crsp')`.

However, not every field has received this update.
In `crsp.ccmxpf_lnkhist`, `lpermno` and `lpermco` are not integers, so the need to specify `col_types` still applies to that table.

The second change is that **comments** have been added to the WRDS tables that provide information about the last update for that table.
When I created `db2pq`, no information about the date of the last update was stored in the database.
To address this, in `wrds_update_pq()`, I ran a `PROC CONTENTS` on the SAS data file and extracted the "last updated" information from that.
Assuming that the SAS data and the PostgreSQL data would be in sync, I stored the SAS "last updated" information in the metadata for the parquet file created from the PostgreSQL data.
Now that similar information is stored in the WRDS tables, I no longer need to call SAS behind the scenes.

The latest version of `db2pq` (version 0.1.4) uses the PostgreSQL update information by default.
One side-effect of this switch is that it you run `wrds_update_pq()` for a parquet file you already have, it will detect the *change* in the last-updated data and proceed to update, even if no change in the data has occurred.
To avoid this, you can specify `use_sas=True` when calling `wrds_update_pq()`.

Another change in version 0.1.4 of `db2pq` is the addition of the function `pq_last_updated()` to extract "last updated" metadata from a parquet data repository.
This can be useful for **data version control** for WRDS data.
WRDS does not appear to provide any archived vintages of its data sets, so it is up to users to put aside versions of WRDS data to match what was used in analysis (e.g., to ensure reproducibility of a published paper, or even to ensure that results from a few months ago hold up).

The WRDS web query interface is problematic from the point of view of reproducibility, as it is too easy to lose the details of the query used to produce the data and of the vintage of the underlying table.
If you're a SAS user, you can grab `dsf.sas7bdat` from the relevant directory under `crsp`.
But otherwise things have been more difficult.

In this regard, there's a lot of merit in using `wrds_update_pq()` to get WRDS data in a compact, easy-to-use form (i.e., parquet files).
Each parquet file produced by `wrds_update_pq()` comes with "last updated" metadata, previously from SAS files and now from the PostgreSQL database.

In the code below, before I update the data I downloaded recently, I apply `pq_last_updated()` to the repository, then update it, then apply again.
The results can be seen below.

If we don't have have the necessary environment variables set up (e.g., `WRDS_ID` and `DATA_DIR`), we can do that here.

In [1]:
import os
os.environ["WRDS_ID"] = "iangow"
os.environ["DATA_DIR"] = "data"

Next, import the `db2pq` library.
Note that the way `db2pq` was written (by me), you need to set `WRDS_ID` before importing the library. (Sorry!)

The `db2pq` Python library offers `wrds_update_pq()`, a function modelled that creates parquet files using data stored in the WRDS PostgreSQL database.

More discussion of the `db2pq` package can be found [here](https://iangow.github.io/far_book/parquet-wrds.html#approach-2-get-wrds-postgresql-data-using-python).

In [2]:
from db2pq import wrds_update_pq

You next need to set up your `.pgpass` file.
Instructions for doing this our found on the [WRDS website](https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-r/r-from-your-computer/) and also on [the PostgreSQL website](https://www.postgresql.org/docs/current/libpq-pgpass.html).
(Note that I do *not* recommend following the advice provided by WRDS regarding the setting-up of a `.Rprofile` file.)

The code below is based on the script [here](https://iangow.github.io/far_book/update_table_pq_alt.py) (note that clicking on the link will download the script rather than opening it in your browser).

I start with CRSP, which takes the longest time of the three data sources used below (`crsp`, `ff`, and `comp`).
Note that the resulting repository occupies about 4.6 gigabytes of disk space (much smaller than the original PostgreSQL tables or their equivalent SAS data files). 

In [3]:
from db2pq import pq_last_updated
pq_last_updated()

Unnamed: 0,table,schema,last_mod_str,last_mod
0,company,comp,Last modified: 12/27/2025 02:19:16,2025-12-27 02:19:16-05:00
1,funda_fncd,comp,Last modified: 12/27/2025 01:40:59,2025-12-27 01:40:59-05:00
2,idx_daily,comp,Last modified: 12/27/2025 01:27:36,2025-12-27 01:27:36-05:00
3,funda,comp,Last modified: 12/27/2025 01:39:29,2025-12-27 01:39:29-05:00
4,fundq,comp,Last modified: 12/27/2025 01:48:49,2025-12-27 01:48:49-05:00
5,aco_pnfnda,comp,Last modified: 12/27/2025 01:27:44,2025-12-27 01:27:44-05:00
6,r_auditors,comp,Last modified: 12/27/2025 01:25:07,2025-12-27 01:25:07-05:00
7,names_seg,compseg,Last modified: 12/27/2025 13:02:08,2025-12-27 13:02:08-05:00
8,seg_customer,compseg,Last modified: 12/27/2025 13:01:20,2025-12-27 13:01:20-05:00
9,dseexchdates,crsp,Last modified: 01/18/2025 17:37:28,2025-01-18 17:37:28-05:00


In [4]:
# CRSP
wrds_update_pq('ccmxpf_lnkhist', 'crsp', 
               col_types={'lpermno': 'int32',
                          'lpermco': 'int32'})
wrds_update_pq('dsf', 'crsp')
wrds_update_pq('dsi', 'crsp')
wrds_update_pq('erdport1', 'crsp')
wrds_update_pq('comphist', 'crsp')
wrds_update_pq('dsedelist', 'crsp')
wrds_update_pq('dseexchdates', 'crsp')
wrds_update_pq('msf', 'crsp')
wrds_update_pq('msi', 'crsp')
wrds_update_pq('mse', 'crsp')
wrds_update_pq('stocknames', 'crsp')
wrds_update_pq('dsedist', 'crsp')

Updated crsp.ccmxpf_lnkhist is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:06:39 UTC.
Completed file download at 2026-01-05 20:06:43 UTC.

Updated crsp.dsf is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:06:43 UTC.
Completed file download at 2026-01-05 20:18:20 UTC.

Updated crsp.dsi is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:18:20 UTC.
Completed file download at 2026-01-05 20:18:24 UTC.

Updated crsp.erdport1 is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:18:24 UTC.
Completed file download at 2026-01-05 20:21:21 UTC.

Updated crsp.comphist is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:21:22 UTC.
Completed file download at 2026-01-05 20:21:31 UTC.

Updated crsp.dsedelist is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:21:31 UTC.
Completed file download at 2026-01-05 20:21:35 UTC.

Updated crsp.msf is available.
Getting from

In [5]:
# Fama-French library
wrds_update_pq('factors_daily', 'ff')

Updated ff.factors_daily is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:24:07 UTC.
Completed file download at 2026-01-05 20:24:11 UTC.



In [6]:
# Compustat
wrds_update_pq('company', 'comp')
wrds_update_pq('funda', 'comp')
wrds_update_pq('funda_fncd', 'comp')
wrds_update_pq('fundq', 'comp')
wrds_update_pq('r_auditors', 'comp')
wrds_update_pq('idx_daily', 'comp')
wrds_update_pq('aco_pnfnda', 'comp')

# compseg
wrds_update_pq('seg_customer', 'compseg')
wrds_update_pq('names_seg', 'compseg')

Updated comp.company is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:24:11 UTC.
Completed file download at 2026-01-05 20:24:15 UTC.

Updated comp.funda is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:24:16 UTC.
Completed file download at 2026-01-05 20:26:37 UTC.

Updated comp.funda_fncd is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:26:37 UTC.
Completed file download at 2026-01-05 20:27:58 UTC.

Updated comp.fundq is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:27:59 UTC.
Completed file download at 2026-01-05 20:31:13 UTC.

Updated comp.r_auditors is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:31:14 UTC.
Completed file download at 2026-01-05 20:31:17 UTC.

Updated comp.idx_daily is available.
Getting from WRDS.
Beginning file download at 2026-01-05 20:31:18 UTC.
Completed file download at 2026-01-05 20:31:52 UTC.

Updated comp.aco_pnfnda is available.
Gett

In [7]:
(pq_last_updated()).drop(columns=["last_mod"])

Unnamed: 0,table,schema,last_mod_str
0,company,comp,Company (Updated 2026-01-05)
1,funda_fncd,comp,Fundamental Annual Footnote and Data Code File...
2,idx_daily,comp,Index Daily (Updated 2026-01-05)
3,funda,comp,Merged Fundamental Annual File (Updated 2026-0...
4,fundq,comp,Merged Fundamental Quarterly File (Updated 202...
5,aco_pnfnda,comp,Pension Annual Item (Updated 2026-01-05)
6,r_auditors,comp,Auditors Reference Data (Updated 2026-01-05)
7,names_seg,compseg,Name File - Segments (Updated 2026-01-05)
8,seg_customer,compseg,Segment Customer (Updated 2026-01-05)
9,dseexchdates,crsp,Last modified: 01/18/2025 17:37:28
