Pandas-aware non-linear least squares regression using Lmfit
Jupyter Notebook Python Other
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
conda_recipe Forgot to bump conda Jul 24, 2016
pdLSR Working on fix for single column grouping Aug 15, 2016
.gitignore Initial commit Jul 4, 2016
LICENSE Working on pip package Jul 10, 2016
MANIFEST.in Updates to naming Jul 16, 2016
README.md Update readme text Jul 17, 2016
environment.yml
requirements.txt Update requirements Jul 30, 2016
setup.py Updates to naming Jul 16, 2016

README.md

pdLSR: Pandas-aware least squares regression

Overview

pdLSR is a library for performing least squares regression. It attempts to seamlessly incorporate this task in a Pandas-focused workflow. Input data are expected in dataframes, and multiple regressions can be performed using functionality similar to Pandas groupby. Results are returned as grouped dataframes and include best-fit parameters, statistics, residuals, and more. The results can be easily visualized using seaborn.

pdLSR currently utilizes lmfit, a flexible and powerful library for least squares minimization, which in turn, makes use of scipy.optimize.leastsq. I began using lmfit because it is one of the few libraries that supports non-linear least squares regression, which is commonly used in the natural sciences. I also like the flexibility it offers for testing different modeling scenarios and the variety of assessment statistics it provides. However, I found myself writing many for loops to perform regressions on groups of data and aggregate the resulting output. Simplification of this task was my inspiration for writing pdLSR.

pdLSR is related to libraries such as statsmodels and scikit-learn that provide linear regression functions that operate on dataframes. However, these libraries don't support grouping operations on dataframes and don't aggregate output into dataframes. Supporting statsmodels and scikit-learn in the future is being considered. (And pull requests adding this functionality would be welcome.)

Some additional 'niceties' associated with the input of parameters and equations have also been incorporated. pdLSR also utilizes multithreading for the calculation of confidence intervals, as this process is time consuming when there are more than a few groups.

Setup

Dependencies

The following libraries are required for pdLSR:

  • numpy
  • pandas
  • lmfit
  • multiprocess

multiprocess is a fork of Python's multiprocessing library that provides more robust multithreading. I found that this library is required for multithreading to work with pdLSR. Both multiprocess and lmfit will install automatically from pip or conda (see below).

For plotting, matplotlib is required and seaborn is recommended.

pdLSR works with Python 2 and 3.

Installation and Demo

Binder

The preferred method for installing pdLSR and all of its dependencies is to use the conda or pip package managers.

  • For conda: conda install -c mlgill pdlsr -- unfortunately conda seems to require lowercase names for packages
  • For pip: pip install pdLSR

However it can also be installed manually by cloning the repo into your PYTHONPATH.

There is a demo notebook that can be executed locally or live from GitHub using mybinder.org. After clicking the badge at the top of this section, navigate to pdLSR --> demo --> pdLSR_demo.ipynb and everything should be setup to execute the demo in a browser. No installation required!

Documentation

The functions of pdLSR are documented within the code, but currently the best single source for using pdLSR is the demo notebook. Developing stand-alone documentation is a future goal.