Skip to content

Commit

Permalink
Initial documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
mfarragher committed Nov 10, 2019
1 parent bcfd970 commit 2f900f3
Show file tree
Hide file tree
Showing 8 changed files with 352 additions and 0 deletions.
Binary file added docs/img/2x2-diagnostics-ols.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
58 changes: 58 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
<header>
<p style="font-size:28px;"><b>appelpy: Applied Econometrics Library for Python</b></p>
</header>

**appelpy** is the *Applied Econometrics Library for Python*. It seeks to bridge the gap between the software options that have a simple syntax (such as Stata) and other powerful options that use Python's object-oriented programming as part of data modelling workflows. ⚗️

Econometric modelling and general regression analysis in Python have never been easier!

The library builds upon the functionality of the 'vanilla' Python data stack (e.g. Pandas, Numpy, etc.) and other libraries such as Statsmodels.

## 10 Minutes to Appelpy
Explore the core functionality of Appelpy in the **10 Minutes To Appelpy** notebook (click the badges):

- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mfarragher/appelpy-examples/master?filepath=00_ten-minutes-to-appelpy.ipynb): interactive experience of the *10 Minutes to Appelpy* tutorial via Binder.
- [![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/00_ten-minutes-to-appelpy.ipynb): static render of the *10 Minutes to Appelpy* notebook.

# Installation
Install the library via the Pip command:
``` bash
pip install appelpy
```

Supported for Python 3.6 and higher versions.

# Basic usage
It only takes one line of code to fit a basic linear model of 'y on X' and another line to return the model's results.

```python
from appelpy.linear_model import OLS

model1 = OLS(df, y_list, X_list).fit() # y_list & X_list contain df columns
model1.results_output # returns summary results
```

Model objects have many useful attributes, e.g. the inputs X & y, standardized X and y values, results of fitted models (incl. standardized estimates). The library also has diagnostic classes and functions that consume model objects (or else their underlying data).

These are more things that can be obtained via one line of code:

* *Diagnostics* can be called from the object: e.g. produce a P-P plot via `model1.diagnostic_plot('pp_plot')`
* *Model selection statistics*: e.g. find the root mean square error of the model from `model1.model_selection_stats`
* *Standardized model estimates*: `model1.results_output_standardized`

Classes in the library have a fluent interface, so that they can be instantiated and have chained methods in one line of code.

# Modules
## Exploration and pre-processing
- **`eda`:** functions for exploratory data analysis (EDA) of datasets, e.g. `statistical_moments` for obtaining mean, variance, skewness and kurtosis of all numeric columns.
- **`utils`:** classes and functions for data pre-processing, e.g. encoding of interaction effects and dummy variables in datasets.
- `DummyEncoder`: encode dummy variables in a dataset based on different policies for dealing with NaN values.
- `InteractionEncoder`: encode interaction effects of variables in a dataset.
## Model fitting
- **`linear_model`:** classes for linear models such as Ordinary Least Squares (OLS) and Weighted Least Squares (WLS).
- **`discrete_model`:** classes for discrete choice models, e.g. logistic regression (Logit).
## Model diagnostics
- **`diagnostics`:**
- `BadApples`: class for inspecting observations that could 'stink up' a model, i.e. the observations that are outliers, high-leverage points or else have high influence in a model.
- `variance_inflation_factors`: function that returns variance inflation factor (VIF) scores for regressors in a dataset.
- `partial_regression_plot`: also known as 'added variable plot'. Examine the effect of adding a regressor to a model.
90 changes: 90 additions & 0 deletions docs/reference/diagnostics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
<header>
<pre><p style="font-size:28px;"><b>diagnostics</b></p></pre>
</header>

# Overview
The **`diagnostics`** module has classes and functions to examine the fit of OLS models and the extreme observations in datasets.

The main class is the `BadApples` class, which consumes an OLS model object and is used to examine the outliers, high-leverage points and influential points in a model. In essence it is used to examine the 'bad apples' that may be stinking up a model's results.

The main methods are:

- `variance_inflation_factors`
- `heteroskedasticity_test`
- `partial_regression_plot`

There are also methods for diagnostic plots such as `pp_plot` but they are exposed more conveniently in an OLS model object method.

## `BadApples`
The **10 Minutes To Appelpy** notebook fits a **BadApples** instance, consuming a model of the California Test Score dataset.

- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mfarragher/appelpy-examples/master?filepath=00_ten-minutes-to-appelpy.ipynb): interactive experience of the *10 Minutes to Appelpy* tutorial via Binder.
- [![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/00_ten-minutes-to-appelpy.ipynb): static render of the *10 Minutes to Appelpy* notebook.

```python
from appelpy.diagnostics import BadApples
bad_apples = BadApples(model_hc1).fit()
```

### Attributes
- Measures: `measures_influence`, `measures_leverage` and `measures_outliers`.
- Indices: `indices_high_influence`, `indices_high_leverage` and `indices_outliers`.

INFLUENCE:

- dfbeta (for each independent variable): DFBETA diagnostic.
Extreme if val > 2 / sqrt(n)
- dffits (for each independent variable): DFFITS diagnostic.
Extreme if val > 2 * sqrt(k / n)
- cooks_d: Cook's distance. Extreme if val > 4 / n

LEVERAGE:

- leverage: value from hat matrix diagonal. Extreme if
val > (2*k + 2) / n

OUTLIERS:

- resid_standardized: standardized residual. Extreme if
|val| > 2, i.e. approx. 5% of observations will be
flagged.
- resid_studentized: studentized residual. Extreme if
|val| > 2, i.e. approx. 5% of observations will be
flagged.

### Methods
The `plot_leverage_vs_residuals_squared` method plots leverage values (y-axis) against the residuals squared (x-axis). The plot can be annotated with the index values.

## Variance inflation factors
The `variance_inflation_factors` method takes a dataframe and calculates the variance inflation factors of its regressors.

## Heteroskedasticity test
The `heteroskedasticity_test` method takes an OLS model object and returns the results of a heteroskedasticity test (the test statistic and p-value). Examples of heteroskedasticity tests include:

- Breusch-Pagan test (`breusch_pagan`)
- Breusch-Pagan studentized test (`breusch_pagan_studentized`)
- White test (`white`)

The **10 Minutes To Appelpy** notebook shows the results of heteroskedasticity tests, given a model fitted to the California Test Score dataset.

- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mfarragher/appelpy-examples/master?filepath=00_ten-minutes-to-appelpy.ipynb): interactive experience of the *10 Minutes to Appelpy* tutorial via Binder.
- [![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/00_ten-minutes-to-appelpy.ipynb): static render of the *10 Minutes to Appelpy* notebook.

Here is a code snippet for a heteroskedasticity test.
```python
from appelpy.diagnostics import heteroskedasticity_test

ep, pval = heteroskedasticity_test('breusch_pagan_studentized', model_nonrobust)
print('Breusch-Pagan test (studentized)')
print('Test statistic: {:.4f}'.format(ep))
print('Test p-value: {:.4f}'.format(pval))
```

## Partial regression plot
Also known as the added variable plot, the partial regression plot shows the effect of adding another regressor (independent variable) to a regression model.

The method requires these parameters:

- `appelpy_model_object`: a fitted OLS model object.
- `df`: the dataframe used in the model.
- `regressor`: the additional variable in the partial regression.
46 changes: 46 additions & 0 deletions docs/reference/discrete-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
<header>
<pre><p style="font-size:28px;"><b>discrete_model</b></p></pre>
</header>

# Overview
These are the classes for discrete choice models:

- Logistic regression (`Logit`)

The classes are built upon Statsmodels.

# Fit a model
[![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/02-01_logistic-regression_glm-logit.ipynb): static render of the notebook that fits a Logit regression. Predictions are also made using the original data to show the estimated probabilities of the positive class.

```python
from appelpy.discrete_model import Logit
model1 = Logit(df, y_list, X_list).fit()
model1.results_output # returns summary results
```

The **`fit` method must be called** in order to set attributes for the model object.

There are three **important parameters** for initialising any model class in Appelpy:

- `df`: the dataframe to use for modelling. This must have no NaN values, no infinite values, etc.
- `y_list`: a list with the dependent variable column name.
- `regressors_list`: a list of column names for independent variables (regressors).

## Attributes
Here are some attributes available for discrete choice models:

- `y` and `X`: the dataframes of the dependent and independent variables.
- `y_standardized` and `X_standardized`: the standardized versions of `y` and `X`.
- `results_output` for the Statsmodels summary of the model. Note: the Statsmodels results object is also stored in the `results` attribute.
- `results_output_standardized` for the standardized estimates of the model.
- `model_selection_stats`: dictionary of key statistics on the model fit.
- The model residuals `resid` and their standardized form `resid_standardized`.

Logit has the `odds_ratio` attribute.

## Methods
For all model classes there is a `significant_regressors` method that returns a list of the significant independent variables of a model, given a significance level *alpha*.

Use `fit` to fit a model.

Pass a Numpy array to a `predict` call in order to make predictions given a model. The method considers whether the regressors values passed to the method are 'within sample' before returning predictions. By default, predictions are only returned if all the regressor values are 'within sample'.
20 changes: 20 additions & 0 deletions docs/reference/eda.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
<header>
<pre><p style="font-size:28px;"><b>eda</b></p></pre>
</header>

# Overview
The **`eda`** module has functions to support exploratory data analysis.

## Statistical moments
The **10 Minutes To Appelpy** notebook shows the statistical moments of the California Test Score dataset.

- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mfarragher/appelpy-examples/master?filepath=00_ten-minutes-to-appelpy.ipynb): interactive experience of the *10 Minutes to Appelpy* tutorial via Binder.
- [![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/00_ten-minutes-to-appelpy.ipynb): static render of the *10 Minutes to Appelpy* notebook.

```python
from appelpy.eda import statistical_moments
statistical_moments(df)
```

## Correlation heatmap
The `correlation_heatmap` method produces a heatmap (triangular form) of the correlation matrix, given a dataset `df`.
72 changes: 72 additions & 0 deletions docs/reference/linear-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
<header>
<pre><p style="font-size:28px;"><b>linear_model</b></p></pre>
</header>

# Overview
There are two classes for linear models:

- Weighted Least Squares (WLS)
- Ordinary Least Squares (OLS)

**OLS** is the core model class in Appelpy. Many of the model diagnostics available are built for OLS models.

The classes are built upon Statsmodels. Features that are available in Appelpy include standardized estimates for models.

# Fit a model
The **10 Minutes To Appelpy** notebook fits an **OLS** model to the California Test Score dataset.

- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mfarragher/appelpy-examples/master?filepath=00_ten-minutes-to-appelpy.ipynb): interactive experience of the *10 Minutes to Appelpy* tutorial via Binder.
- [![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/00_ten-minutes-to-appelpy.ipynb): static render of the *10 Minutes to Appelpy* notebook.

```python
from appelpy.linear_model import OLS
model1 = OLS(df, y_list, X_list).fit()
model1.results_output # returns summary results
```

The **`fit` method must be called** in order to set attributes for the model object.

There are three **important parameters** for initialising any model class in Appelpy:

- `df`: the dataframe to use for modelling. This must have no NaN values, no infinite values, etc.
- `y_list`: a list with the dependent variable column name.
- `regressors_list`: a list of column names for independent variables (regressors).

Other keyword arguments can be used when initialising the model class, e.g. `cov_type` to specify the type of standard errors in the model.

## Attributes
Here are some attributes available for linear models:

- `y` and `X`: the dataframes of the dependent and independent variables.
- `y_standardized` and `X_standardized`: the standardized versions of `y` and `X`.
- `results_output` for the Statsmodels summary of the model. Note: the Statsmodels results object is also stored in the `results` attribute.
- `results_output_standardized` for the standardized estimates of the model.
- `model_selection_stats`: dictionary of key statistics on the model fit, including the root mean square error (root MSE).
- The model residuals `resid` and their standardized form `resid_standardized`.

## Methods
For all linear model classes there is a `significant_regressors` method that returns a list of the significant independent variables of a model, given a significance level *alpha*.

Use `fit` to fit a model.

Pass a Numpy array to a `predict` call in order to make predictions given a model. The method considers whether the regressors values passed to the method are 'within sample' before returning predictions. By default, predictions are only returned if all the regressor values are 'within sample'.

### Diagnostic plot method
There is also a convenient `diagnostic_plot` method that consumes an OLS model object, in order to return plots such as:

- P-P plot (`pp_plot`)
- Q-Q plot (`qq_plot`)
- Residuals vs fitted values plot (`rvf_plot`)
- Residuals vs predicted values plot (`rvp_plot`)

Here is a useful recipe for producing a 2x2 grid of diagnostic plots with Matplotlib:
```python
fig, axarr = plt.subplots(2, 2, figsize=(10, 10))
model_nonrobust.diagnostic_plot('pp_plot', ax=axarr[0][0])
model_nonrobust.diagnostic_plot('qq_plot', ax=axarr[0][1])
model_nonrobust.diagnostic_plot('rvp_plot', ax=axarr[1][0])
model_nonrobust.diagnostic_plot('rvf_plot', ax=axarr[1][1])
plt.tight_layout()
```

![OLS diagnostics plot](/img/2x2-diagnostics-ols.png)
49 changes: 49 additions & 0 deletions docs/reference/utils.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
<header>
<pre><p style="font-size:28px;"><b>utils</b></p></pre>
</header>

# Overview
The **`utils`** module has classes and functions to support pre-processing of datasets before modelling.

The main classes are:

- `DummyEncoder`
- `InteractionEncoder`

## `DummyEncoder`
[![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/01-01_dummy-and-interaction-encoders_hsbdemo.ipynb): static render of a notebook that has an example of dummy variable encoding.

Suppose `schtyp`, `prog` and `honors` are categorical variables in a dataframe `df_raw` and we want to make dummy variables for them in a new transformed dataset. A new dataframe `df` is returned by calling the `transform` method on an instance of `DummyEncoder`:
```python
from appelpy.utils import DummyEncoder
df = (DummyEncoder(df_raw, {'schtyp': None,
'prog': None,
'honors': None})
.transform())
```

The class is initialised with:

- A raw dataframe
- A dictionary of key-value pairs, where each pair has a categorical variable as a key and a base level as a value.
- A 'NaN policy' to detemine how NaN values should be treated in a dataset, if any. Where a categorical variable has a NaN value, the default behaviour makes the dummy columns have values of 0. The keyword argument can be changed to deal with cases where 'missingness' of data is not random, e.g. create a column for NaN value or make all dummy columns have NaN value.

If the base level is None then a dummy column is created for every value of a categorical variable.

## `InteractionEncoder`
[![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/01-01_dummy-and-interaction-encoders_hsbdemo.ipynb): static render of a notebook that shows the many ways in which interaction effects can be encoded (e.g. between two Boolean variables, a continuous variable & categorical variable, etc.).

Suppose `math` and `socst` are two continuous variables in a dataframe `df_raw` and we want to make an interaction effect `math#socst`. A new dataframe `df_model` is returned – with the new column `math#socst` – by calling the `transform` method on an instance of `InteractionEncoder`:
```python
from appelpy.utils import InteractionEncoder
df_model = InteractionEncoder(df_raw, {'math': ['socst']}).transform()
```

## `get_dataframe_columns_diff` method
This method compares the columns in two dataframes, so it is handy when comparing a raw dataframe and a transformed dataframe.

Suppose that there is a raw dataframe `df_raw` and a transformed dataframe `df_enc`. The recipe below will display the columns removed from the raw dataframe and the columns added to the transformed dataframe:
```python
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")
```
17 changes: 17 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
site_name: appelpy
repo_url: https://github.com/mfarragher/appelpy

theme:
name: readthedocs

nav:
- Home: index.md
- Reference:
- eda: reference/eda.md
- utils: reference/utils.md
- linear_model: reference/linear-model.md
- discrete_model: reference/discrete-model.md
- diagnostics: reference/diagnostics.md

markdown_extensions:
- admonition

0 comments on commit 2f900f3

Please sign in to comment.