Skip to content

Commit

Permalink
[MRG + 2] Allow f_regression to accept a sparse matrix with centering (
Browse files Browse the repository at this point in the history
…scikit-learn#8065)

* Updated centering for f_regression

Allows f_regression to accept a sparse matrix when centering=True.

* Fixed E226 spacing issue.

* Added f_regression sparse update to whats_new.rst
  • Loading branch information
acadiansith authored and paulha committed Aug 19, 2017
1 parent 3f72f4c commit b764e65
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 5 deletions.
3 changes: 3 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,9 @@ Enhancements
kernels which were previously prohibited. :issue:`8005` by `Andreas Müller`_ .


- Added ability to use sparse matrices in :func:`feature_selection.f_regression`
with ``center=True``. :issue:`8065` by :user:`Daniel LeJeune <acadiansith>`.

Bug fixes
.........

Expand Down
6 changes: 6 additions & 0 deletions sklearn/feature_selection/tests/test_feature_select.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,12 @@ def test_f_regression():
assert_true((pv[:5] < 0.05).all())
assert_true((pv[5:] > 1.e-4).all())

# with centering, compare with sparse
F, pv = f_regression(X, y, center=True)
F_sparse, pv_sparse = f_regression(sparse.csr_matrix(X), y, center=True)
assert_array_almost_equal(F_sparse, F)
assert_array_almost_equal(pv_sparse, pv)

# again without centering, compare with sparse
F, pv = f_regression(X, y, center=False)
F_sparse, pv_sparse = f_regression(sparse.csr_matrix(X), y, center=False)
Expand Down
20 changes: 15 additions & 5 deletions sklearn/feature_selection/univariate_selection.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,17 +266,27 @@ def f_regression(X, y, center=True):
f_classif: ANOVA F-value between label/feature for classification tasks.
chi2: Chi-squared stats of non-negative features for classification tasks.
"""
if issparse(X) and center:
raise ValueError("center=True only allowed for dense data")
X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64)
n_samples = X.shape[0]

# compute centered values
# note that E[(x - mean(x))*(y - mean(y))] = E[x*(y - mean(y))], so we
# need not center X
if center:
y = y - np.mean(y)
X = X.copy('F') # faster in fortran
X -= X.mean(axis=0)
if issparse(X):
X_means = X.mean(axis=0).getA1()
else:
X_means = X.mean(axis=0)
# compute the scaled standard deviations via moments
X_norms = np.sqrt(row_norms(X.T, squared=True) -
n_samples * X_means ** 2)
else:
X_norms = row_norms(X.T)

# compute the correlation
corr = safe_sparse_dot(y, X)
corr /= row_norms(X.T)
corr /= X_norms
corr /= norm(y)

# convert to p-value
Expand Down

0 comments on commit b764e65

Please sign in to comment.