# Data and Empirical Strategy

## Data Sources and Sample Construction
To construct a continuous panel of bank balance sheets spanning 1984 to 2019, we had to reconcile a major structural break in Federal Reserve reporting formats. Prior to 2001, FFIEC Call Report data were distributed as SAS Transport (XPT) files with non-standard date indexing and cryptic variable codes. Post-2001, the format shifted to flat text/CSV files with standardized naming conventions. We developed a dual-pipeline ingestion strategy to harmonize these eras. A custom Python parser processed the legacy XPT files, recovering reporting dates from filename metadata (e.g., `call8403.xpt`), while a separate recursive scraper aggregated the modern directory structures. These streams were standardized into a unified schema (matching `RCFD` and `RCON` prefixes) and concatenated to form a seamless 140-quarter time series. (See Appendix for technical details and replication code).

![Alt text](graphs/datatransformation.png)

We applied strict filtering criteria to ensure panel stability. We excluded entities with fewer than 8 quarters of continuous data to mitigate noise from short-lived institutions and to ensure sufficient observations for valid dynamic lag construction. The final dataset contains approximately 740,000 bank-quarter observations.


To measure the stance of monetary policy communication, we construct a sentiment index based on the full corpus of Federal Open Market Committee (FOMC) transcripts. We collected the raw transcript data directly from the **Federal Reserve Board of Governors'** historical archives. Using a custom web scraper, we retrieved the meeting transcripts for every FOMC meeting from 1976 to 2019. Unlike the concise "Post-Meeting Statements," transcripts provide a verbatim record of the deliberations, offering a richer dataset for sentiment extraction.




## Sentiment Analysis

We employ a dictionary-based textual analysis using the Loughran-McDonald (2011) financial sentiment lexicon, implemented via the `pysentiment2` Python library. For each meeting $t$, we tokenize the full transcript and calculate a "Net Tone" score based on the relative frequency of positive and negative financial terms. The polarity score is defined as:

$$
NetSentiment_t = \frac{Positive_{t} - Negative_{t}}{Positive_{t} + Negative_{t}}
$$

To isolate the "Pure Sentiment" channel, we must distinguish between communication and policy action. We retrieve the Daily Effective Federal Funds Rate (Series: DFF) directly from the Federal Reserve Bank of St. Louis (FRED) API. We then orthogonalize the sentiment score against the effective rate using the regression described below. The resulting residual represents the "Sentiment Shock"â€”variation in Fed communication that cannot be explained by the current interest rate level. Finally, we align these shocks with bank reporting periods using the "Late-Quarter Shift" protocol (see Appendix B) to ensure banks had access to the information prior to filing their Call Reports.

# Econometric Strategy

## Model Specification
To identify the causal effect of sentiment on risk-taking, we estimate a dynamic panel model including bank-fixed effects. While dynamic panels with fixed effects can suffer from Nickell bias (Nickell, 1981) in short time frames, our sample covers a long time dimension ($T = 92$ quarters). As the bias is of order $O(1/T)$, it is negligible in our context (Judson & Owen, 1999). We therefore employ a standard Fixed Effects (Within) estimator rather than Difference-GMM or Anderson-Hsiao, allowing for more efficient use of the data.

The primary specification is as follows:

$$
Risk_{i,t} = \alpha_i + \rho Risk_{i,t-1} + \beta_1 Shock_{t-1} + \beta_2 (Shock_{t-1} \times Size_{i,t-1}) + \beta_3 Controls_{i,t-1} + \epsilon_{i,t}
$$

Where:
* $Risk_{i,t}$ is the Loans-to-Assets ratio.
* $\alpha_i$ represents bank-specific fixed effects (time-invariant heterogeneity).
* $Shock_{t-1}$ is the "Purged" Sentiment Shock, standardized to unit variance.
* $Size_{i,t-1}$ is the natural log of total assets, centered at the sample mean to facilitate interpretation of the main effect.

We compute standard errors clustered by both **Entity (Bank)** and **Time (Date)** to account for serial correlation within banks and common shocks affecting all banks simultaneously.

Regression code can be found in Appendix C

![risk taking pre post 2008](graphs/prepost2008.png)
