# finsets

> Download and process datasets commonly used in finance research

Each module handles a different data source. Almost all submodules (other than utility ones) have a `download` function that downloads the raw data and a `clean` function that processes the data into a `pandas.DataFrame` having, as index, either:

- A `pandas.Period` date reflecting the frequency of the data (for time-series datasets), or
- A `pandas.MultiIndex` with a panel identifier in the first dimension and a `pandas.Period` date in the second dimension (for panel datasets).

The period date in the index will be named following the pattern `Xdate` where X is the string literal representing the frequency of the data (e.g. `Mdate` for monthly data, `Qdate` for quarterly data, `Ydate` for annual data).

[Documentation site](https://ionmihai.github.io/finsets/).

[GitHub page](https://github.com/ionmihai/finsets).

## Install

```sh
pip install finsets
```

## How to use

In [None]:
import finsets as fds

or

In [None]:
from finsets import fred, wrds, papers, metadata

Below, we very briefly describe each submodule. For more details, please see the documentation of each submodule (they provide a lot more functionality than presented here).

## Core

> Functions that are not specific to a particular data source. 

The functions in this model are available directly in the `finsets` namespace. For example:

In [None]:
metadata.features_metadata().head()

Unnamed: 0,name,label,output_of,inputs,inputs_generated_by
0,bookeq,Book equity,wrds.compa.book_equity,"at,lt,seq,ceq,txditc,pstk,pstkrv,pstkl,itcb",wrds.compa.clean
1,shreq,Shareholder equity,wrds.compa.book_equity,"at,lt,seq,ceq,txditc,pstk,pstkrv,pstkl,itcb",wrds.compa.clean
2,pref_stock,Preferred stock,wrds.compa.book_equity,"at,lt,seq,ceq,txditc,pstk,pstkrv,pstkl,itcb",wrds.compa.clean
3,tobinq,Tobin Q,wrds.compa.tobin_q,"at,lt,seq,ceq,txditc,pstk,pstkrv,pstkl,itcb,pr...",wrds.compa.clean
4,equityiss_tot,Equity issuance,wrds.compa.issuance_vars,"at,lt,seq,ceq,txditc,pstk,pstkrv,pstkl,itcb,ss...",wrds.compa.clean


In [None]:
metadata.search('total assets').head()

Unnamed: 0_level_0,NAME,LABEL,OUTPUT_OF,INPUTS,INPUTS_GENERATED_BY,TYPE,NR_ROWS,WRDS_LIBRARY,WRDS_TABLE,GROUP
SCORE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,at,Assets - Total,wrds.compa.download,,,DOUBLE_PRECISION,881927,comp,funda,
75,act,Current Assets - Total,wrds.compa.download,,,DOUBLE_PRECISION,881927,comp,funda,
75,ao,Assets - Other,wrds.compa.download,,,DOUBLE_PRECISION,881927,comp,funda,
71,batr,Benefits Assumed - Total,wrds.compa.download,,,DOUBLE_PRECISION,881927,comp,funda,
69,rvti,Reserves - Total,wrds.compa.download,,,DOUBLE_PRECISION,881927,comp,funda,


## WRDS

> Downloads and processes datasets from Wharton Research Data Services [WRDS](https://wrds-www.wharton.upenn.edu/). 

Each WRDS module handles a different library in WRDS (e.g. `compa` module for the Compustat Annual CCM file, `crspm` for the CRSP Monthly Stock file, etc.).


Before you use any of the `wrds` modules, you need to create a `pgpass` with your WRDS credentials. To do that, run

In [None]:
from finsets.wrds import wrds_api

In [None]:
#| eval: false
db = wrds_api.Connection()

This will prompt you for your WRDS username and password. After you enter your credentials, if you don't have a `pgpass` file already set up, it will ask you if you want to do that. Hit `y` and it will be automatically created for you. After this, you will never have to input your WRDS password. 

You will still have to supply your WRDS username to functions that retrieve data from WRDS (all of them have a `wrds_username` parameter). If you don't want to be prompted for the username for every download, save it under a `WRDS_USERNAME` environment variable:

- On Windows, in a Command Prompt: 
    - ```setx WRDS_USERNAME "your_wrds_username_here"```
- On Linux, in a terminal: 
    - `echo 'export WRDS_USERNAME="your_wrds_username_here"' >> ~/.bashrc && source ~/.bashrc`
- On macOS, since macOS Catalina:
    - `echo 'export WRDS_USERNAME="your_wrds_username_here"' >> ~/.zshrc && source ~/.szhrc`
- On macOS, prior to macOS Catalina:
    - `echo 'export WRDS_USERNAME="your_wrds_username_here"' >> ~/.bash_profile && source ~/.bash_profile`


The functions in the `wrds_` modules will close database connections to WRDS automatically. However, if you open a connection manually, as above (with `wrds.Connection()`) make sure you remember to close that connection. In our example above:

In [None]:
#| eval: false
db.close()

Check the `wrds_utils` module for an introduction to some of the main utilities that come with the `wrds` package.

## FRED

> Downloads and processes datasets from the St. Louis [FRED](https://fred.stlouisfed.org/).

To use the functions in the `fred` module, you'll need an API key from the St. Louis FRED. 

Get one [here](https://fred.stlouisfed.org/docs/api/api_key.html) and store it in your environment variables under the name `FRED_API_KEY` 

Alternatively, you can supply the API key directly as the `api_key` parameter in each function in the `fred` module.

In [None]:
fred.clean(vars = ['GDP'])

{'Q':            dtdate    nom_gdp
 Qdate                       
 1947Q1 1947-01-01    243.164
 1947Q2 1947-04-01    245.968
 1947Q3 1947-07-01    249.585
 1947Q4 1947-10-01    259.745
 1948Q1 1948-01-01    265.742
 ...           ...        ...
 2022Q3 2022-07-01  25994.639
 2022Q4 2022-10-01  26408.405
 2023Q1 2023-01-01  26813.601
 2023Q2 2023-04-01  27063.012
 2023Q3 2023-07-01  27623.543
 
 [307 rows x 2 columns]}

## PAPERS

> Downloads and processes datasets made available by the authors of academic papers.

Each `papers` module handles a different paper. The naming convention is that the module's name is made up of the last names of the authors and the publication year, separated by underscores. If more than two authors, all but the first author's name is replaced by 'etal'. For example, the module for the paper “Firm-Level Political Risk: Measurement and Effects” (2019) by Tarek A. Hassan, Stephan Hollander, Laurence van Lent, Ahmed Tahoun is named `hasan_etal_2019`.

In [None]:
papers.hassan_etal_2019.variables()[:7]

['gvkey', 'date', 'date_earningscall', 'PRisk', 'NPRisk', 'Risk', 'PSentiment']

In [None]:
#| hide 
#| eval: false
import os, glob
for f in glob.glob('../data/*'): os.remove(f)
with open('../data/.gitkeep', 'w') as f: pass 