# Getting Started with PyFixest

In a first step, we load the module and some example data:

In [20]:
import pandas as pd
from lets_plot import LetsPlot

import pyfixest as pf
from pyfixest.did.estimation import did2s
from pyfixest.did.event_study import event_study

%load_ext watermark
%watermark --iversions

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
pandas  : 2.2.2
pyfixest: 0.19.0b0



In [21]:
data = pf.get_data()
data.head()

Unnamed: 0,Y,Y2,X1,X2,f1,f2,f3,group_id,Z1,Z2,weights
0,,2.357103,0.0,0.457858,15.0,0.0,7.0,9.0,-0.330607,1.054826,0.661478
1,-1.458643,5.163147,,-4.998406,6.0,21.0,4.0,8.0,,-4.11369,0.772732
2,0.169132,0.75114,2.0,1.55848,,1.0,7.0,16.0,1.207778,0.465282,0.990929
3,3.319513,-2.656368,1.0,1.560402,1.0,10.0,11.0,3.0,2.869997,0.46757,0.021123
4,0.13442,-1.866416,2.0,-3.472232,19.0,20.0,6.0,14.0,0.835819,-3.115669,0.790815


## OLS Estimation

We can estimate a fixed effects regression via the `feols()` function. `feols()` has three arguments: a two-sided model formula, the data, and optionally, the type of inference.

In [22]:
fit = pf.feols(fml="Y~X1 | f1", data=data, vcov="HC1")
type(fit)

pyfixest.estimation.feols_.Feols

The first part of the formula contains the dependent variable and "regular" covariates, while the second part contains fixed effects.

`feols()` returns an instance of the `Fixest` class.

To inspect the results, we can use a summary function or method:

In [23]:
fit.summary()

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1
Inference:  HC1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.949 |        0.066 |   -14.311 |      0.000 | -1.080 |  -0.819 |
---
RMSE: 1.73R2: 0.437R2 Within: 0.161


Alternatively, the `.summarize` module contains a `summary` function, which can be applied on instances of regression model objects 
or lists of regression model objects. 

In [24]:
pf.summary(fit)

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1
Inference:  HC1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.949 |        0.066 |   -14.311 |      0.000 | -1.080 |  -0.819 |
---
RMSE: 1.73R2: 0.437R2 Within: 0.161


You can access individual elements of the summary via dedicated methods: `.tidy()` returns a "tidy" `pd.DataFrame`, 
`.coef()` returns estimated parameters, and `se()` estimated standard errors. Other methods include `pvalue()`, `confint()`
and `tstat()`.

In [25]:
fit.coef()

Coefficient
X1   -0.949441
Name: Estimate, dtype: float64

In [26]:
fit.se()

Coefficient
X1    0.066343
Name: Std. Error, dtype: float64

## Standard Errors and Inference

Supported covariance types are "iid", "HC1-3", CRV1 and CRV3 (up to two-way clustering). Inference can be adjusted "on-the-fly" via the
`.vcov()` method:

In [27]:
fit.vcov({"CRV1": "group_id + f1"}).summary()
fit.vcov({"CRV3": "group_id"}).summary()

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.949 |        0.088 |   -10.839 |      0.000 | -1.133 |  -0.765 |
---
RMSE: 1.73R2: 0.437R2 Within: 0.161
###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1
Inference:  CRV3
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.949 |        0.095 |   -10.005 |      0.000 | -1.149 |  -0.750 |
---
RMSE: 1.73R2: 0.437R2 Within: 0.161


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._cluster_df["cluster_intersection"] = cluster_df_one_str.str.cat(


It is also possible to run a wild (cluster) bootstrap after estimation (via the [wildboottest module](https://github.com/s3alfisc/wildboottest)):

In [28]:
fit2 = pf.feols(fml="Y~ X1", data=data, vcov={"CRV1": "group_id"})
fit2.wildboottest(param="X1", B=999)

param                            X1
t value           7.568059291000728
Pr(>|t|)                        0.0
bootstrap_type                   11
inference             CRV(group_id)
impose_null                    True
dtype: object

Additionally, `PyFixest` supports the causal cluster variance estimator following [Abadie et al. (2023)](https://academic.oup.com/qje/article/138/1/1/6750017). 

In [29]:
df = pd.read_stata("C:/Users/alexa/Downloads/census2000_5pc.dta")
fit3 = pf.feols("ln_earnings ~ college", vcov={"CRV1": "state"}, data=df)
fit3.ccv(treatment="college", pk=0.05, n_splits=2, seed=929)

Unnamed: 0,Estimate,Std. Error,t value,Pr(>|t|),2.5%,97.5%
CCV,0.4656425903701483,0.00348,133.820078,0.0,0.458657,0.472628
CRV1,0.465643,0.027142,17.155606,0.0,0.411152,0.520133


To correct for multiple testing, p-values can be adjusted via either the Bonferroni or the method by Romano and Wolf (2005).

In [30]:
pf.bonferroni([fit, fit2], param="X1").round(3)

Unnamed: 0,est0,est1
Estimate,-0.949,-1.0
Std. Error,0.095,0.117
t value,-10.005,-8.568
Pr(>|t|),0.0,0.0
2.5%,-1.149,-1.245
97.5%,-0.75,-0.755
Bonferroni Pr(>|t|),0.0,0.0


In [31]:
pf.rwolf([fit, fit2], param="X1", B=9999, seed=1234).round(3)

Unnamed: 0,est0,est1
Estimate,-0.949,-1.0
Std. Error,0.095,0.117
t value,-10.005,-8.568
Pr(>|t|),0.0,0.0
2.5%,-1.149,-1.245
97.5%,-0.75,-0.755
RW Pr(>|t|),0.893,0.0


## IV Estimation 

It is also possible to estimate instrumental variable models with *one* endogenous variable and (potentially multiple) instruments:

In [32]:
iv_fit = pf.feols(fml="Y2~ 1 | f1 + f2 | X1 ~ Z1 + Z2", data=data)
iv_fit.summary()

###

Estimation:  IV
Dep. var.: Y2, Fixed effects: f1+f2
Inference:  CRV1
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -1.600 |        0.333 |    -4.801 |      0.000 | -2.282 |  -0.919 |
---
RMSE: nanR2: nanR2 Within: nan


If the model does not contain any fixed effects, just drop the second part of the formula above:

In [33]:
pf.feols(fml="Y~ 1 | X1 ~ Z1 + Z2", data=data).summary()

###

Estimation:  IV
Dep. var.: Y, Fixed effects: 
Inference:  iid
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.911 |        0.156 |     5.843 |      0.000 |  0.605 |   1.217 |
| X1            |     -0.993 |        0.134 |    -7.398 |      0.000 | -1.256 |  -0.730 |
---
RMSE: nanR2: nanR2 Within: nan


IV estimation with multiple endogenous variables and multiple estimation syntax is currently not supported. The syntax is "depvar ~ exog.vars | fixef effects | endog.vars ~ instruments".

## Poisson Regression 

With version `0.8.4`, it is possible to estimate Poisson Regressions (not yet on PyPi): 

In [34]:
pois_data = pf.get_data(model="Fepois")
pois_fit = pf.fepois(fml="Y~X1 | f1+f2", data=pois_data, vcov={"CRV1": "group_id"})
pois_fit.summary()

###

Estimation:  Poisson
Dep. var.: Y, Fixed effects: f1+f2
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.006 |        0.032 |    -0.181 |      0.856 | -0.068 |   0.057 |
---
RMSE: nanR2: nanR2 Within: nanDeviance: 1070.014


## Multiple Estimation 

`PyFixest` supports a range of multiple estimation functionality: `sw`, `sw0`, `csw`, `csw0`, and multiple dependent variables. If multiple regression syntax is used, 
`feols()` and `fepois` returns an instance of a `FixestMulti` object, which essentially consists of a dicionary of `Fepois` or [Feols(/reference/Feols.qmd) instances.

In [35]:
multi_fit = pf.feols(fml="Y~X1 | csw0(f1, f2)", data=data, vcov="HC1")
multi_fit

<pyfixest.estimation.FixestMulti_.FixestMulti at 0x16c3ff88100>

In [36]:
multi_fit.summary()

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: 
Inference:  HC1
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.919 |        0.112 |     8.223 |      0.000 |  0.699 |   1.138 |
| X1            |     -1.000 |        0.082 |   -12.134 |      0.000 | -1.162 |  -0.838 |
---
RMSE: 2.158R2: 0.123R2 Within: nan
###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1
Inference:  HC1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.949 |        0.066 |   -14.311 |      0.000 | -1.080 |  -0.819 |
---
RMSE: 1.73R2: 0.437R2 Within: 0.161
###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference:  HC1
Observations:  997

| Coefficient   |   

Alternatively, you can look at the estimation results via the `etable()` method:

In [37]:
multi_fit.etable()

                           est1               est2               est3
------------  -----------------  -----------------  -----------------
depvar                        Y                  Y                  Y
---------------------------------------------------------------------
Intercept      0.919*** (0.112)
X1            -1.000*** (0.082)  -0.949*** (0.066)  -0.919*** (0.058)
---------------------------------------------------------------------
f1                            -                  x                  x
f2                            -                  -                  x
                              x                  -                  -
---------------------------------------------------------------------
R2                        0.123              0.437              0.609
S.E. type                hetero             hetero             hetero
Observations                998                997                997
----------------------------------------------------------

You can access an individual model by its name - i.e. a formula - via the `all_fitted_models` attribure.

In [39]:
multi_fit.all_fitted_models["Y~X1"].tidy()

Unnamed: 0_level_0,Estimate,Std. Error,t value,Pr(>|t|),2.5%,97.5%
Coefficient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Intercept,0.918518,0.111707,8.22258,6.661338e-16,0.69931,1.137725
X1,-1.000086,0.08242,-12.134086,0.0,-1.161822,-0.83835


or equivalently via the `fetch_model` method:

In [40]:
multi_fit.fetch_model(0).tidy()

Model:  Y~X1


Unnamed: 0_level_0,Estimate,Std. Error,t value,Pr(>|t|),2.5%,97.5%
Coefficient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Intercept,0.918518,0.111707,8.22258,6.661338e-16,0.69931,1.137725
X1,-1.000086,0.08242,-12.134086,0.0,-1.161822,-0.83835


Here, `0` simply fetches the first model stored in the `all_fitted_models` dictionary, `1` the second etc.

Objects of type `Fixest` come with a range of additional methods: `tidy()`, `coef()`, `vcov()` etc, which 
essentially loop over the equivalent methods of all fitted models. E.g. `Fixest.vcov()` updates inference for all 
models stored in `Fixest`.

In [41]:
multi_fit.vcov("iid").summary()

###

Estimation:  OLS
Dep. var.: Y, Fixed effects: 
Inference:  iid
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.919 |        0.112 |     8.214 |      0.000 |  0.699 |   1.138 |
| X1            |     -1.000 |        0.085 |   -11.802 |      0.000 | -1.166 |  -0.834 |
---
RMSE: 2.158R2: 0.123R2 Within: nan
###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1
Inference:  iid
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.949 |        0.069 |   -13.846 |      0.000 | -1.084 |  -0.815 |
---
RMSE: 1.73R2: 0.437R2 Within: 0.161
###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference:  iid
Observations:  997

| Coefficient   |   

If you have estimated multiple models without multiple estimation syntax and still want to compare them, you can use the `etable()` function: 

In [42]:
pf.etable([fit, fit2])

                           est1               est2
------------  -----------------  -----------------
depvar                        Y                  Y
--------------------------------------------------
X1            -0.949*** (0.095)  -1.000*** (0.117)
Intercept                         0.919*** (0.121)
--------------------------------------------------
f1                            x                  -
                              -                  x
--------------------------------------------------
R2                        0.437              0.123
S.E. type          by: group_id       by: group_id
Observations                997                998
--------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
Format of coefficient cell:
Coefficient (Std. Error)


## Visualization 

`PyFixest` provides two functions to visualize the results of a regression: `coefplot` and `iplot`.

In [43]:
LetsPlot.setup_html()

multi_fit.coefplot().show()

## Difference-in-Differences / Event Study Designs

`PyFixest` supports eventy study designs via two-way fixed effects and Gardner's 2-stage estimator. 

In [44]:
url = "https://raw.githubusercontent.com/s3alfisc/pyfixest/master/pyfixest/did/data/df_het.csv"
df_het = pd.read_csv(url)
df_het.head()

Unnamed: 0,unit,state,group,unit_fe,g,year,year_fe,treat,rel_year,rel_year_binned,error,te,te_dynamic,dep_var
0,1,33,Group 2,7.043016,2010,1990,0.066159,False,-20.0,-6,-0.086466,0,0.0,7.022709
1,1,33,Group 2,7.043016,2010,1991,-0.03098,False,-19.0,-6,0.766593,0,0.0,7.778628
2,1,33,Group 2,7.043016,2010,1992,-0.119607,False,-18.0,-6,1.512968,0,0.0,8.436377
3,1,33,Group 2,7.043016,2010,1993,0.126321,False,-17.0,-6,0.02187,0,0.0,7.191207
4,1,33,Group 2,7.043016,2010,1994,-0.106921,False,-16.0,-6,-0.017603,0,0.0,6.918492


In [45]:
fit_did2s = did2s(
    df_het,
    yname="dep_var",
    first_stage="~ 0 | state + year",
    second_stage="~i(rel_year,ref= -1.0)",
    treatment="treat",
    cluster="state",
)

fit_twfe = pf.feols(
    "dep_var ~ i(rel_year,ref = -1.0) | state + year",
    df_het,
    vcov={"CRV1": "state"},
)

pf.iplot(
    [fit_did2s, fit_twfe], coord_flip=False, figsize=(900, 400), title="TWFE vs DID2S"
)

The `event_study()` function provides a common API for several event study estimators.

In [46]:
fit_twfe = event_study(
    data=df_het,
    yname="dep_var",
    idname="state",
    tname="year",
    gname="g",
    estimator="twfe",
)

fit_did2s = event_study(
    data=df_het,
    yname="dep_var",
    idname="state",
    tname="year",
    gname="g",
    estimator="did2s",
)

pf.etable([fit_twfe, fit_did2s])

                          est1              est2
------------  ----------------  ----------------
depvar                 dep_var       dep_var_hat
------------------------------------------------
ATT           2.135*** (0.044)  2.152*** (0.048)
------------------------------------------------
year                         x                 -
state                        x                 -
                             -                 x
------------------------------------------------
R2                       0.758             0.338
S.E. type            by: state              CRV1
Observations             46500             46500
------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
Format of coefficient cell:
Coefficient (Std. Error)
