Initial code

manuel-calzolari · Nov 17, 2020 · 070f35a · 070f35a
1 parent 7c59bb2
commit 070f35a
Show file tree

Hide file tree

Showing 9 changed files with 944 additions and 0 deletions.
diff --git a/README.rst b/README.rst
@@ -0,0 +1,210 @@
+=========
+shapicant
+=========
+
+**shapicant** is a feature selection package based on `SHAP <https://github.com/slundberg/shap>`_ [LUN]_ and target permutation, for pandas and Spark.
+
+It is inspired by PIMP [ALT]_, with some differences:
+
+- PIMP fits a probability distribution to the population of null importances or, alternatively, uses a non-parametric estimation of the PIMP p-values. Instead, shapicant only implements the non-parametric estimation.
+- For the non-parametric estimation, PIMP computes the fraction of null importances that are more extreme than the true importance (i.e. :code:`r/n`). Instead, shapicant computes it as :code:`(r+1)/(n+1)` [NOR]_.
+- PIMP uses the Gini importance of Random Forest models or the Mutual Information criterion. Instead, shapicant uses SHAP values.
+- While feature importance measures such as the Gini importance show an absolute feature importance, SHAP provides both positive and negative impacts. Instead of taking the mean absolute value of the SHAP values for each feature as feature importance, shapicant takes the mean value for positive and negative SHAP values separately. The true importance needs to be consistently higher than null importances for both positive and negative impacts. For multi-class classification, the true importance needs to be higher for at least one of the classes.
+- While feature importance measures such as the Gini importance of Random Forest models are computed on the training set, SHAP values can be computed out-of-sample. Therefore, shapicant allows to compute them on a distinct validation set. To decide whether to compute them on the training set or on a validation set, you can refer to this discussion for "`Training vs. Test Data <https://compstat-lmu.github.io/iml_methods_limitations/pfi-data.html>`_" (it talks about PFI [BRE]_, which is a different algorithm, but the general idea is still applicable).
+
+Permuting the response vector instead of permuting features has some advantages:
+
+- The dependence between predictor variables remains unchanged.
+- The number of permutations can be much smaller than the number of predictor variables for high dimensional datasets (unlike PFI [BRE]_) and there is no need to add shadow features (unlike Boruta [KUR]_).
+- Since the features set does not change, in the Spark implementation there is no need to change the features vector at each iteration.
+
+------------
+Installation
+------------
+
+^^^^^^^^^^^^
+Dependencies
+^^^^^^^^^^^^
+
+shapicant requires:
+
+- Python (>= 3.6)
+- shap (>= 0.36.0)
+- numpy
+- pandas
+- scikit-learn
+- tqdm
+
+For Spark, we also need:
+
+- pyspark (>= 2.4)
+- pyarrow
+
+^^^^^^^^^^^^^^^^^
+User installation
+^^^^^^^^^^^^^^^^^
+
+The easiest way to install shapicant is using :code:`pip`:
+
+.. code:: bash
+
+    pip install shapicant
+
+--------
+Examples
+--------
+
+^^^^^^^^^^^^^^
+PandasSelector
+^^^^^^^^^^^^^^
+
+If our data fit into the memory of a single machine, :code:`PandasSelector` is the way to go. This selector works on Pandas DataFrames and supports estimators that have a sklearn-like API.
+
+First we’ll need to import a bunch of useful packages and generate some data to work with.
+
+.. code:: python
+
+    import pandas as pd
+    from sklearn.datasets import make_classification
+    from sklearn.model_selection import train_test_split
+
+    # Generate a random classification problem
+    X, y = make_classification(
+        n_samples=1000,
+        n_features=25,
+        n_informative=3,
+        n_redundant=2,
+        n_repeated=2,
+        n_classes=3,
+        n_clusters_per_class=1,
+        shuffle=False,
+        random_state=42,
+    )
+
+    # PandasSelector works with pandas DataFrames, so convert X to a DataFrame
+    X = pd.DataFrame(X)
+
+    # Split training and validation sets
+    # Note: in a real world setting, you probably want a test set as well
+    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)
+
+We will use :code:`PandasSelector` with a LightGBM classifier in Random Forest mode and SHAP's TreeExplainer.
+
+.. code:: python
+
+    from shapicant import PandasSelector
+    import lightgbm as lgb
+    import shap
+
+    # LightGBM in RandomForest-like mode (with rows subsampling), without columns subsampling
+    model = lgb.LGBMClassifier(
+        boosting_type="rf",
+        subsample_freq=1,
+        subsample=0.632,
+        n_estimators=100,
+        n_jobs=-1,
+        random_state=42,
+    )
+    
+    # This is the class (not its instance) of SHAP's TreeExplainer
+    explainer_type = shap.TreeExplainer
+    
+    # Use PandasSelector with 100 iterations
+    selector = PandasSelector(model, explainer_type, n_iter=100, random_state=42)
+    
+    # Run the feature selection
+    # If we provide a validation set, SHAP values are computed on it, otherwise they are computed on the training set
+    # We can also provide additional parameters to the underlying estimator's fit method through estimator_params
+    selector.fit(X_train, y_train, X_validation=X_val, estimator_params={"categorical_feature": None})
+
+    # Get the DataFrame with the selected features (with a p-value <= 0.05)
+    X_train_selected = selector.transform(X_train, alpha=0.05)
+    X_val_selected = selector.transform(X_val, alpha=0.05)
+
+    # We can also get the p-values as pandas Series
+    p_values = selector.p_values_
+
+^^^^^^^^^^^^^^
+SparkSelector
+^^^^^^^^^^^^^^
+
+If our data does not fit into the memory of a single machine, :code:`SparkSelector` can be an alternative. This selector works on Spark DataFrames and supports PySpark estimators.
+
+Please keep in mind the following caveats:
+
+- Spark adds a lot of overhead, so if our data fit into the memory of a single machine, :code:`PandasSelector` will be much faster.
+- SHAP does not support categorical features with Spark estimators (see https://github.com/slundberg/shap/pull/721).
+- Data provided to :code:`SparkSelector` is assumed to have already been preprocessed and each feature must correspond to a separate column. For example, if we want to one-hot encode a categorical feature, we must do so before providing the dataset to :code:`SparkSelector` and each binary variable must have its own column (Vector type columns are not supported).
+
+Let's generate some data to work with.
+
+.. code:: python
+
+    import pandas as pd
+    from sklearn.datasets import make_classification
+    from pyspark.sql import SparkSession
+
+    # Generate a random classification problem
+    X, y = make_classification(
+        n_samples=10000,
+        n_features=25,
+        n_informative=3,
+        n_redundant=2,
+        n_repeated=2,
+        n_classes=3,
+        n_clusters_per_class=1,
+        shuffle=False,
+        random_state=42,
+    )
+
+    # SparkSelector works with Spark DataFrames, so convert data to a DataFrame
+    # Note: in a real world setting, you probably load data from parquet files or other sources
+    spark = SparkSession.builder.getOrCreate()
+    sdf = spark.createDataFrame(pd.DataFrame(X).assign(label=y))
+
+    # Split training and validation sets (to keep the example simple, we don't split in a stratified fashion)
+    # Note: in a real world setting, you probably want a test set as well
+    sdf_train, sdf_val = sdf.randomSplit([0.80, 0.20], seed=42)
+
+We will use :code:`SparkSelector` with a Random Forest classifier and SHAP's TreeExplainer.
+
+.. code:: python
+
+    from shapicant import SparkSelector
+    from pyspark.ml.classification import RandomForestClassifier
+    import shap
+
+    # Spark's Random Forest (with bootstrap), without columns subsampling
+    # Note: the "featuresCol" and "labelCol" parameters are ignored here, since they are set by SparkSelector
+    model = RandomForestClassifier(
+        featureSubsetStrategy="all",
+        numTrees=20,
+        seed=42
+    )
+    
+    # This is the class (not its instance) of SHAP's TreeExplainer
+    explainer_type = shap.TreeExplainer
+    
+    # Use SparkSelector with 50 iterations
+    selector = SparkSelector(model, explainer_type, n_iter=50, random_state=42)
+    
+    # Run the feature selection
+    # If we provide a validation set, SHAP values are computed on it, otherwise they are computed on the training set
+    selector.fit(sdf_train, label_col="label", sdf_validation=sdf_val)
+
+    # Get the DataFrame with the selected features (with a p-value <= 0.10)
+    sdf_train_selected = selector.transform(sdf_train, label_col="label", alpha=0.10)
+    sdf_val_selected = selector.transform(sdf_val, label_col="label", alpha=0.10)
+
+    # We can also get the p-values as pandas Series
+    p_values = selector.p_values_
+
+----------
+References
+----------
+
+.. [LUN] Lundberg, S., & Lee, S.I. (2017). A unified approach to interpreting model predictions. In *Advances in Neural Information Processing Systems* (pp. 4765–4774).
+.. [ALT] Altmann, A., Toloşi, L., Sander, O., & Lengauer, T. (2010). Permutation importance: a corrected feature importance measure *Bioinformatics, 26* (10), 1340-1347.
+.. [NOR] North, B. V., Curtis, D., & Sham, P. C. (2002). A note on the calculation of empirical P values from Monte Carlo procedures. *American journal of human genetics, 71* (2), 439–441.
+.. [BRE] Breiman, L. (2001). Random Forests *Machine Learning, 45* (1), 5–32.
+.. [KUR] Kursa, M., & Rudnicki, W. (2010). Feature Selection with Boruta Package *Journal of Statistical Software, 36*, 1-13.
diff --git a/setup.cfg b/setup.cfg
@@ -0,0 +1,5 @@
+[metadata]
+license_files=LICENSE
+
+[bdist_wheel]
+universal=1
diff --git a/setup.py b/setup.py
@@ -0,0 +1,41 @@
+from io import open
+from os import path
+
+from setuptools import find_packages, setup
+
+import shapicant
+
+here = path.abspath(path.dirname(__file__))
+
+# Get the long description from the README file
+with open(path.join(here, "README.rst"), encoding="utf-8") as f:
+    long_description = f.read()
+
+setup(
+    name="shapicant",
+    version=shapicant.__version__,
+    description="Feature selection package based on SHAP and target permutation, for pandas and Spark",
+    long_description=long_description,
+    long_description_content_type="text/x-rst",
+    url="https://github.com/manuel-calzolari/shapicant",
+    download_url="https://github.com/manuel-calzolari/shapicant/releases",
+    author="Manuel Calzolari",
+    classifiers=[
+        "Development Status :: 4 - Beta",
+        "Intended Audience :: Science/Research",
+        "Intended Audience :: Developers",
+        "Topic :: Software Development",
+        "Topic :: Scientific/Engineering",
+        "License :: OSI Approved :: MIT License",
+        "Programming Language :: Python :: 3",
+        "Programming Language :: Python :: 3.6",
+        "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
+    ],
+    packages=find_packages(),
+    python_requires=">=3.6",
+    install_requires=["shap>=0.36.0", "numpy", "pandas", "scikit-learn", "tqdm"],
+    extras_require={
+        "spark": ["pyspark>=2.4", "pyarrow"],
+    },
+)
diff --git a/shapicant/__init__.py b/shapicant/__init__.py
@@ -0,0 +1,11 @@
+"""
+The shapicant module implements a feature selection algorithm based on SHAP and target permutation.
+
+"""
+from ._base import BaseSelector
+from ._pandas_selector import PandasSelector
+from ._spark_selector import SparkSelector
+
+__version__ = "0.1.0"
+
+__all__ = ["BaseSelector", "PandasSelector", "SparkSelector"]
diff --git a/shapicant/_base.py b/shapicant/_base.py
@@ -0,0 +1,94 @@
+"""
+Base class for all selectors.
+
+"""
+
+from abc import ABCMeta, abstractmethod
+from functools import reduce
+from typing import List, Optional, Type, Union
+
+from numpy.random import RandomState
+from pandas import Series
+from shap import Explainer
+
+
+class BaseSelector(metaclass=ABCMeta):
+    """Abstract base class for all selectors in shapicant.
+
+    Args:
+        estimator: A supervised learning estimator with a 'fit' method.
+        explainer_type: A SHAP explainer type.
+        n_iter: The number of iterations to perform.
+        verbose: Controls verbosity of output.
+        random_state: Parameter to control the random number generator used.
+
+    Attributes:
+        p_values_ (Series): Series containing the empirical p-values of the features.
+
+    """
+
+    def __init__(
+        self,
+        estimator: object,
+        explainer_type: Type[Explainer],
+        n_iter: int = 100,
+        verbose: Union[int, bool] = 1,
+        random_state: Optional[Union[int, RandomState]] = None,
+    ) -> None:
+        self.estimator = estimator
+        self.explainer_type = explainer_type
+        self.n_iter = n_iter
+        self.verbose = verbose
+        self.random_state = random_state
+        self.p_values_ = None
+        self._current_iter = None
+        self._n_outputs = None
+
+    @abstractmethod
+    def fit(self, *args, **kwargs):
+        """
+        Abstract 'fit' method.
+
+        """
+
+    @abstractmethod
+    def transform(self, *args, **kwargs):
+        """
+        Abstract 'transform' method.
+
+        """
+
+    @abstractmethod
+    def fit_transform(self, *args, **kwargs):
+        """
+        Abstract 'fit_transform' method.
+
+        """
+
+    def _check_is_fitted(self):
+        if self.p_values_ is None:
+            raise AttributeError(
+                "This instance is not fitted yet. Call 'fit' with appropriate arguments before using this method."
+            )
+
+    def _validate_params(self):
+        if self.n_iter < 10:
+            raise ValueError("n_iter must be greater than or equal to 10.")
+
+    def _compute_p_values(
+        self,
+        true_pos_shap_values: List[Series],
+        null_pos_shap_values: List[Series],
+        true_neg_shap_values: List[Series],
+        null_neg_shap_values: List[Series],
+    ) -> Series:
+        pos_results = [None] * self._n_outputs
+        neg_results = [None] * self._n_outputs
+        results = [None] * self._n_outputs
+        for i in range(self._n_outputs):
+            pos_results[i] = null_pos_shap_values[i].ge(true_pos_shap_values[i], axis=0)
+            neg_results[i] = null_neg_shap_values[i].le(true_neg_shap_values[i], axis=0)
+            results[i] = pos_results[i] | neg_results[i]
+        results = reduce(lambda df_0, df_1: df_0 & df_1, results).sum(axis=1)
+        p_values = (results + 1) / (self.n_iter + 1)
+        return p_values