Initial documentation

mfarragher · Nov 10, 2019 · 2f900f3 · 2f900f3
1 parent bcfd970
commit 2f900f3
Show file tree

Hide file tree

Showing 8 changed files with 352 additions and 0 deletions.
diff --git a/docs/img/2x2-diagnostics-ols.png b/docs/img/2x2-diagnostics-ols.png
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,58 @@
+<header>
+<p style="font-size:28px;"><b>appelpy: Applied Econometrics Library for Python</b></p>
+</header>
+
+**appelpy** is the *Applied Econometrics Library for Python*.  It seeks to bridge the gap between the software options that have a simple syntax (such as Stata) and other powerful options that use Python's object-oriented programming as part of data modelling workflows.  ⚗️
+
+Econometric modelling and general regression analysis in Python have never been easier!
+
+The library builds upon the functionality of the 'vanilla' Python data stack (e.g. Pandas, Numpy, etc.) and other libraries such as Statsmodels.
+
+## 10 Minutes to Appelpy
+Explore the core functionality of Appelpy in the **10 Minutes To Appelpy** notebook (click the badges):
+
+- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mfarragher/appelpy-examples/master?filepath=00_ten-minutes-to-appelpy.ipynb): interactive experience of the *10 Minutes to Appelpy* tutorial via Binder.
+- [![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/00_ten-minutes-to-appelpy.ipynb): static render of the *10 Minutes to Appelpy* notebook.
+
+# Installation
+Install the library via the Pip command:
+``` bash
+pip install appelpy
+```
+
+Supported for Python 3.6 and higher versions.
+
+# Basic usage
+It only takes one line of code to fit a basic linear model of 'y on X' and another line to return the model's results.
+
+```python
+from appelpy.linear_model import OLS
+
+model1 = OLS(df, y_list, X_list).fit()  # y_list & X_list contain df columns
+model1.results_output  # returns summary results
+```
+
+Model objects have many useful attributes, e.g. the inputs X & y, standardized X and y values, results of fitted models (incl. standardized estimates).  The library also has diagnostic classes and functions that consume model objects (or else their underlying data).
+
+These are more things that can be obtained via one line of code:
+
+* *Diagnostics* can be called from the object: e.g. produce a P-P plot via `model1.diagnostic_plot('pp_plot')`
+* *Model selection statistics*: e.g. find the root mean square error of the model from `model1.model_selection_stats`
+* *Standardized model estimates*: `model1.results_output_standardized`
+
+Classes in the library have a fluent interface, so that they can be instantiated and have chained methods in one line of code.
+
+# Modules
+## Exploration and pre-processing
+- **`eda`:** functions for exploratory data analysis (EDA) of datasets, e.g. `statistical_moments` for obtaining mean, variance, skewness and kurtosis of all numeric columns.
+- **`utils`:** classes and functions for data pre-processing, e.g. encoding of interaction effects and dummy variables in datasets.
+    - `DummyEncoder`: encode dummy variables in a dataset based on different policies for dealing with NaN values.
+    - `InteractionEncoder`: encode interaction effects of variables in a dataset.
+## Model fitting
+- **`linear_model`:** classes for linear models such as Ordinary Least Squares (OLS) and Weighted Least Squares (WLS).
+- **`discrete_model`:** classes for discrete choice models, e.g. logistic regression (Logit).
+## Model diagnostics
+- **`diagnostics`:**
+    - `BadApples`: class for inspecting observations that could 'stink up' a model, i.e. the observations that are outliers, high-leverage points or else have high influence in a model.
+    - `variance_inflation_factors`: function that returns variance inflation factor (VIF) scores for regressors in a dataset.
+    - `partial_regression_plot`: also known as 'added variable plot'.  Examine the effect of adding a regressor to a model.
diff --git a/docs/reference/diagnostics.md b/docs/reference/diagnostics.md
@@ -0,0 +1,90 @@
+<header>
+<pre><p style="font-size:28px;"><b>diagnostics</b></p></pre>
+</header>
+
+# Overview
+The **`diagnostics`** module has classes and functions to examine the fit of OLS models and the extreme observations in datasets.
+
+The main class is the `BadApples` class, which consumes an OLS model object and is used to examine the outliers, high-leverage points and influential points in a model.  In essence it is used to examine the 'bad apples' that may be stinking up a model's results.
+
+The main methods are:
+
+- `variance_inflation_factors`
+- `heteroskedasticity_test`
+- `partial_regression_plot`
+
+There are also methods for diagnostic plots such as `pp_plot` but they are exposed more conveniently in an OLS model object method.
+
+## `BadApples`
+The **10 Minutes To Appelpy** notebook fits a **BadApples** instance, consuming a model of the California Test Score dataset.
+
+- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mfarragher/appelpy-examples/master?filepath=00_ten-minutes-to-appelpy.ipynb): interactive experience of the *10 Minutes to Appelpy* tutorial via Binder.
+- [![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/00_ten-minutes-to-appelpy.ipynb): static render of the *10 Minutes to Appelpy* notebook.
+
+```python
+from appelpy.diagnostics import BadApples
+bad_apples = BadApples(model_hc1).fit()
+```
+
+### Attributes
+- Measures: `measures_influence`, `measures_leverage` and `measures_outliers`.
+- Indices: `indices_high_influence`, `indices_high_leverage` and `indices_outliers`.
+
+INFLUENCE:
+
+- dfbeta (for each independent variable): DFBETA diagnostic.
+    Extreme if val > 2 / sqrt(n)
+- dffits (for each independent variable): DFFITS diagnostic.
+    Extreme if val > 2 * sqrt(k / n)
+- cooks_d: Cook's distance.  Extreme if val > 4 / n
+
+LEVERAGE:
+
+- leverage: value from hat matrix diagonal.  Extreme if
+    val > (2*k + 2) / n
+
+OUTLIERS:
+
+- resid_standardized: standardized residual.  Extreme if
+    |val| > 2, i.e. approx. 5% of observations will be
+    flagged.
+- resid_studentized: studentized residual.  Extreme if
+    |val| > 2, i.e. approx. 5% of observations will be
+    flagged.
+
+### Methods
+The `plot_leverage_vs_residuals_squared` method plots leverage values (y-axis) against the residuals squared (x-axis).  The plot can be annotated with the index values.
+
+## Variance inflation factors
+The `variance_inflation_factors` method takes a dataframe and calculates the variance inflation factors of its regressors.
+
+## Heteroskedasticity test
+The `heteroskedasticity_test` method takes an OLS model object and returns the results of a heteroskedasticity test (the test statistic and p-value).  Examples of heteroskedasticity tests include:
+
+- Breusch-Pagan test (`breusch_pagan`)
+- Breusch-Pagan studentized test (`breusch_pagan_studentized`)
+- White test (`white`)
+
+The **10 Minutes To Appelpy** notebook shows the results of heteroskedasticity tests, given a model fitted to the California Test Score dataset.
+
+- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mfarragher/appelpy-examples/master?filepath=00_ten-minutes-to-appelpy.ipynb): interactive experience of the *10 Minutes to Appelpy* tutorial via Binder.
+- [![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/00_ten-minutes-to-appelpy.ipynb): static render of the *10 Minutes to Appelpy* notebook.
+
+Here is a code snippet for a heteroskedasticity test.
+```python
+from appelpy.diagnostics import heteroskedasticity_test
+
+ep, pval = heteroskedasticity_test('breusch_pagan_studentized', model_nonrobust)
+print('Breusch-Pagan test (studentized)')
+print('Test statistic: {:.4f}'.format(ep))
+print('Test p-value: {:.4f}'.format(pval))
+```
+
+## Partial regression plot
+Also known as the added variable plot, the partial regression plot shows the effect of adding another regressor (independent variable) to a regression model.
+
+The method requires these parameters:
+
+- `appelpy_model_object`: a fitted OLS model object.
+- `df`: the dataframe used in the model.
+- `regressor`: the additional variable in the partial regression.
diff --git a/docs/reference/discrete-model.md b/docs/reference/discrete-model.md
@@ -0,0 +1,46 @@
+<header>
+<pre><p style="font-size:28px;"><b>discrete_model</b></p></pre>
+</header>
+
+# Overview
+These are the classes for discrete choice models:
+
+- Logistic regression (`Logit`)
+
+The classes are built upon Statsmodels.
+
+# Fit a model
+[![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/02-01_logistic-regression_glm-logit.ipynb): static render of the notebook that fits a Logit regression.  Predictions are also made using the original data to show the estimated probabilities of the positive class.
+
+```python
+from appelpy.discrete_model import Logit
+model1 = Logit(df, y_list, X_list).fit()
+model1.results_output  # returns summary results
+```
+
+The **`fit` method must be called** in order to set attributes for the model object.
+
+There are three **important parameters** for initialising any model class in Appelpy:
+
+- `df`: the dataframe to use for modelling.  This must have no NaN values, no infinite values, etc.
+- `y_list`: a list with the dependent variable column name.
+- `regressors_list`: a list of column names for independent variables (regressors).
+
+## Attributes
+Here are some attributes available for discrete choice models:
+
+- `y` and `X`: the dataframes of the dependent and independent variables.
+- `y_standardized` and `X_standardized`: the standardized versions of `y` and `X`.
+- `results_output` for the Statsmodels summary of the model.  Note: the Statsmodels results object is also stored in the `results` attribute.
+- `results_output_standardized` for the standardized estimates of the model.
+- `model_selection_stats`: dictionary of key statistics on the model fit.
+- The model residuals `resid` and their standardized form `resid_standardized`.
+
+Logit has the `odds_ratio` attribute.
+
+## Methods
+For all model classes there is a `significant_regressors` method that returns a list of the significant independent variables of a model, given a significance level *alpha*.
+
+Use `fit` to fit a model.
+
+Pass a Numpy array to a `predict` call in order to make predictions given a model.  The method considers whether the regressors values passed to the method are 'within sample' before returning predictions.  By default, predictions are only returned if all the regressor values are 'within sample'.
diff --git a/docs/reference/eda.md b/docs/reference/eda.md
@@ -0,0 +1,20 @@
+<header>
+<pre><p style="font-size:28px;"><b>eda</b></p></pre>
+</header>
+
+# Overview
+The **`eda`** module has functions to support exploratory data analysis.
+
+## Statistical moments
+The **10 Minutes To Appelpy** notebook shows the statistical moments of the California Test Score dataset.
+
+- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mfarragher/appelpy-examples/master?filepath=00_ten-minutes-to-appelpy.ipynb): interactive experience of the *10 Minutes to Appelpy* tutorial via Binder.
+- [![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/00_ten-minutes-to-appelpy.ipynb): static render of the *10 Minutes to Appelpy* notebook.
+
+```python
+from appelpy.eda import statistical_moments
+statistical_moments(df)
+```
+
+## Correlation heatmap
+The `correlation_heatmap` method produces a heatmap (triangular form) of the correlation matrix, given a dataset `df`.
diff --git a/docs/reference/linear-model.md b/docs/reference/linear-model.md
@@ -0,0 +1,72 @@
+<header>
+<pre><p style="font-size:28px;"><b>linear_model</b></p></pre>
+</header>
+
+# Overview
+There are two classes for linear models:
+
+- Weighted Least Squares (WLS)
+- Ordinary Least Squares (OLS)
+
+**OLS** is the core model class in Appelpy.  Many of the model diagnostics available are built for OLS models.
+
+The classes are built upon Statsmodels.  Features that are available in Appelpy include standardized estimates for models.
+
+# Fit a model
+The **10 Minutes To Appelpy** notebook fits an **OLS** model to the California Test Score dataset.
+
+- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mfarragher/appelpy-examples/master?filepath=00_ten-minutes-to-appelpy.ipynb): interactive experience of the *10 Minutes to Appelpy* tutorial via Binder.
+- [![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/00_ten-minutes-to-appelpy.ipynb): static render of the *10 Minutes to Appelpy* notebook.
+
+```python
+from appelpy.linear_model import OLS
+model1 = OLS(df, y_list, X_list).fit()
+model1.results_output  # returns summary results
+```
+
+The **`fit` method must be called** in order to set attributes for the model object.
+
+There are three **important parameters** for initialising any model class in Appelpy:
+
+- `df`: the dataframe to use for modelling.  This must have no NaN values, no infinite values, etc.
+- `y_list`: a list with the dependent variable column name.
+- `regressors_list`: a list of column names for independent variables (regressors).
+
+Other keyword arguments can be used when initialising the model class, e.g. `cov_type` to specify the type of standard errors in the model.
+
+## Attributes
+Here are some attributes available for linear models:
+
+- `y` and `X`: the dataframes of the dependent and independent variables.
+- `y_standardized` and `X_standardized`: the standardized versions of `y` and `X`.
+- `results_output` for the Statsmodels summary of the model.  Note: the Statsmodels results object is also stored in the `results` attribute.
+- `results_output_standardized` for the standardized estimates of the model.
+- `model_selection_stats`: dictionary of key statistics on the model fit, including the root mean square error (root MSE).
+- The model residuals `resid` and their standardized form `resid_standardized`.
+
+## Methods
+For all linear model classes there is a `significant_regressors` method that returns a list of the significant independent variables of a model, given a significance level *alpha*.
+
+Use `fit` to fit a model.
+
+Pass a Numpy array to a `predict` call in order to make predictions given a model.  The method considers whether the regressors values passed to the method are 'within sample' before returning predictions.  By default, predictions are only returned if all the regressor values are 'within sample'.
+
+### Diagnostic plot method
+There is also a convenient `diagnostic_plot` method that consumes an OLS model object, in order to return plots such as:
+
+- P-P plot (`pp_plot`)
+- Q-Q plot (`qq_plot`)
+- Residuals vs fitted values plot (`rvf_plot`)
+- Residuals vs predicted values plot (`rvp_plot`)
+
+Here is a useful recipe for producing a 2x2 grid of diagnostic plots with Matplotlib:
+```python
+fig, axarr = plt.subplots(2, 2, figsize=(10, 10))
+model_nonrobust.diagnostic_plot('pp_plot', ax=axarr[0][0])
+model_nonrobust.diagnostic_plot('qq_plot', ax=axarr[0][1])
+model_nonrobust.diagnostic_plot('rvp_plot', ax=axarr[1][0])
+model_nonrobust.diagnostic_plot('rvf_plot', ax=axarr[1][1])
+plt.tight_layout()
+```
+
+![OLS diagnostics plot](/img/2x2-diagnostics-ols.png)
diff --git a/docs/reference/utils.md b/docs/reference/utils.md
@@ -0,0 +1,49 @@
+<header>
+<pre><p style="font-size:28px;"><b>utils</b></p></pre>
+</header>
+
+# Overview
+The **`utils`** module has classes and functions to support pre-processing of datasets before modelling.
+
+The main classes are:
+
+- `DummyEncoder`
+- `InteractionEncoder`
+
+## `DummyEncoder`
+[![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/01-01_dummy-and-interaction-encoders_hsbdemo.ipynb): static render of a notebook that has an example of dummy variable encoding.
+
+Suppose `schtyp`, `prog` and `honors` are categorical variables in a dataframe `df_raw` and we want to make dummy variables for them in a new transformed dataset.  A new dataframe `df` is returned by calling the `transform` method on an instance of `DummyEncoder`:
+```python
+from appelpy.utils import DummyEncoder
+df = (DummyEncoder(df_raw, {'schtyp': None,
+                            'prog': None,
+                            'honors': None})
+      .transform())
+```
+
+The class is initialised with:
+
+- A raw dataframe
+- A dictionary of key-value pairs, where each pair has a categorical variable as a key and a base level as a value.
+- A 'NaN policy' to detemine how NaN values should be treated in a dataset, if any.  Where a categorical variable has a NaN value, the default behaviour makes the dummy columns have values of 0.  The keyword argument can be changed to deal with cases where 'missingness' of data is not random, e.g. create a column for NaN value or make all dummy columns have NaN value.
+
+If the base level is None then a dummy column is created for every value of a categorical variable.
+
+## `InteractionEncoder`
+[![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.jupyter.org/github/mfarragher/appelpy-examples/blob/master/01-01_dummy-and-interaction-encoders_hsbdemo.ipynb): static render of a notebook that shows the many ways in which interaction effects can be encoded (e.g. between two Boolean variables, a continuous variable & categorical variable, etc.).
+
+Suppose `math` and `socst` are two continuous variables in a dataframe `df_raw` and we want to make an interaction effect `math#socst`.  A new dataframe `df_model` is returned – with the new column `math#socst` – by calling the `transform` method on an instance of `InteractionEncoder`:
+```python
+from appelpy.utils import InteractionEncoder
+df_model = InteractionEncoder(df_raw, {'math': ['socst']}).transform()
+```
+
+## `get_dataframe_columns_diff` method
+This method compares the columns in two dataframes, so it is handy when comparing a raw dataframe and a transformed dataframe.
+
+Suppose that there is a raw dataframe `df_raw` and a transformed dataframe `df_enc`.  The recipe below will display the columns removed from the raw dataframe and the columns added to the transformed dataframe:
+```python
+print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
+print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -0,0 +1,17 @@
+site_name: appelpy
+repo_url: https://github.com/mfarragher/appelpy
+
+theme:
+    name: readthedocs
+
+nav:
+    - Home: index.md
+    - Reference:
+        - eda: reference/eda.md
+        - utils: reference/utils.md
+        - linear_model: reference/linear-model.md
+        - discrete_model: reference/discrete-model.md
+        - diagnostics: reference/diagnostics.md
+
+markdown_extensions:
+    - admonition