A native python implementation of elastic-net regularized generalized linear models
Generalized linear models are well-established tools for regression and classification and are widely applied across the sciences, economics, business, and finance. They are uniquely identifiable due to their convex loss and easy to interpret due to their point-wise non-linearities and well-defined noise models.
In the era of exploratory data analyses with a large number of predictor variables, it is important to regularize. Regularization prevents overfitting by penalizing the negative log likelihood and can be used to articulate prior knowledge about the parameters in a structured form.
Despite the attractiveness of regularized GLMs, the available tools in the Python data science eco-system are highly fragmented. More specifically,
- statsmodels provides a wide range of link functions but no regularization.
- scikit-learn provides elastic net regularization but only for linear models.
- lightning provides elastic net and group lasso regularization, but only for linear and logistic regression.
Pyglmnet is a response to this fragmentation. Here are some highlights.
Pyglmnet provides a wide range of noise models (and paired canonical link functions):
Pyglmnet's API is designed to be compatible with scikit-learn, so you can deploy
Pipelinetools such as
We have implemented a cyclical coordinate descent optimizer with Newton update, active sets, update caching, and warm restarts. This optimization approach is identical to the one used in R package.
A number of Python wrappers exist for the R glmnet package (e.g. here and here) but in contrast to these, Pyglmnet is a pure python implementation. Therefore, it is easy to modify and introduce additional noise models and regularizers in the future.
Here is table comparing
The numbers below are run time (in milliseconds) to fit a $1000$ samples x $100$ predictors sparse matrix (density $0.05$). This was done on a c. 2011 Macbook Pro, so your numbers may vary.
We provide a function called
if you would like to run these benchmarks yourself, but you need to take
care of the dependencies:
$ pip install pyglmnet
Manual installation instructions below:
Clone the repository.
$ git clone http://github.com/glm-tools/pyglmnet
setup.py as follows
$ python setup.py develop install
Here is an example on how to use the
import numpy as np import scipy.sparse as sps from sklearn.preprocessing import StandardScaler from pyglmnet import GLM # create an instance of the GLM class glm = GLM(distr='poisson') n_samples, n_features = 10000, 100 # sample random coefficients beta0 = np.random.normal(0.0, 1.0, 1) beta = sps.rand(n_features, 1, 0.1) beta = np.array(beta.todense()) # simulate training data X_train = np.random.normal(0.0, 1.0, [n_samples, n_features]) y_train = glm.simulate(beta0, beta, X_train) # simulate testing data X_test = np.random.normal(0.0, 1.0, [n_samples, n_features]) y_test = glm.simulate(beta0, beta, X_test) # fit the model on the training data scaler = StandardScaler().fit(X_train) glm.fit(scaler.transform(X_train), y_train) # predict using fitted model on the test data yhat_test = glm.predict(scaler.transform(X_test)) # score the model deviance = glm.score(X_test, y_test)
Here is an extensive tutorial on GLMs, optimization and pseudo-code.
How to contribute?
We welcome pull requests. Please see our developer documentation page for more details.
MIT License Copyright (c) 2016 Pavan Ramkumar