Feature correlation p-values and correction methods #10

feribg · 2018-10-13T18:26:44Z

Wanted to get the conversation open on feature correlation, right now it just does a naive spearmanr, with no insight into the resulting p-values. Would be great to do a few things, listed below in order of importance:

Introduce p-values and maybe apply the appropriate cutoffs
Introduce permutation based correlation, starting off with lagged correlations for example (context is time series analysis)
Introduce a probability correction method for 1 and/or 2 such as bonferroni, to account for the number of correlation estimates we're doing between features and between number of lags if we end up implementing Added ability to directly supply fig size to feature importance plots #2.

Happy to get the conversation going and see where we end up. Right now the feature correlation estimation is not quite stable in the context of very noisy time series data.

parrt · 2018-10-14T00:20:18Z

OOOPS feature_dependence_matrix() is already in the package. haha. Ignore this response other than to compare to what you are considering. :)

Another thing to add would be predicting feature i using all features j≠i but I'm not sure what you call that. Multicollinearity perhaps? I've noticed that many seemingly totally unrelated features are collinear in some way even when their spearman's coefficient is low. Dropping such a seemingly important feature then has no effect on the overall model accuracy because some of the other features are picking up the slack. I could swear I did some tests recently but I can't find the code. Oh right. It's a simple matter of dropping the original target and a feature of interest. For example, if we are predicting rent price but would like to know the strength of the relationship between predictor doorman and the other features, we can train a model using that as the target:

X_train, y_train = df_aug.drop(['price','doorman'], axis=1), df_aug['doorman']

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
rf.fit(X_train, y_train)
print(rf.oob_score_)

With a high R^2 like .8 or .9, we can predict that feature pretty well from the other features. I recall that doorman was a fairly important feature and I was wondering why dropping it did not affect the model accuracy very much. Well, obviously the other features are covering for it but I think the correlation coefficients were low against the other features.

def feature_dependence_matrix(X_train):
    """
    Given training observation independent variables in X_train (a dataframe),
    compute the feature importance using each var as a dependent variable.
    We retrain a random forest for each var as target using the others as
    independent vars.  Only numeric columns are considered.

    :return: a non-symmetric data frame with the dependence matrix where each row is the importance of each var to the row's var used as a model target.
    """
    numcols = [col for col in X_train if is_numeric_dtype(X_train[col])]

    df_dep = pd.DataFrame(index=X_train.columns, columns=['Dependence']+X_train.columns.tolist())
    for i in range(len(numcols)):
        col = numcols[i]
        X, y = X_train.drop(col, axis=1), X_train[col]
        rf = RandomForestRegressor(n_estimators=50, n_jobs=-1, oob_score=True)
        rf.fit(X,y)
        #imp = rf.feature_importances_
        imp = permutation_importances_raw(rf, X, y, oob_regression_r2_score)
        imp = np.insert(imp, i, 1.0)
        df_dep.iloc[i] = np.insert(imp, 0, rf.oob_score_) # add overall dependence

    return df_dep

I get feature dependence like this:

The dependencies are now clear. doorman is predictable using other features. Some make sense like the GPS stuff but doorman through me a bit.

feribg · 2018-10-15T03:36:33Z

@parrt I had a look at this function and it's a good starting point to use as a base but I have several questions/comments around it, because it seems you've modified the code above and it's different from what's there on master now (which seems broken to me):

It assumes a random forest classifier (presumably the same one that we use for the final model. The problem is a lot of the variables that are used as features are continuous so when converting them to targets the whole thing falls apart. (You substituted for a regressor above, but i guess we should provide a flex impl as a param)
These couple of lines are a bit ambiguous to me:

imp = permutation_importances_raw(rf, X, y, oob_regression_r2_score, n_samples)
imp = np.insert(imp, i, 1.0)
df_dep.iloc[i] = np.insert(imp, 0, rf.oob_score_) # add overall dependence

Why are we using the oob_regression_r2 and what's it's meaning, seems like we want classification but are running r2 as a scoring function, which is confusing to me. Would it always be regression even if we have a mix of continous and categorical ?
3. To the above, should we not add support for val_set, if oob_score_ is not reliable or not present at all.
Happy to discuss and cleanup that function first in master than build on top of it.

parrt · 2020-11-25T22:11:19Z

Hi. I don't think I'm going to be pushing too many new features into this package so I will close this to clean up.

feribg · 2020-12-02T18:20:33Z

@parrt fair, this is I think a more full fledged library that uses the same method at it's core with some extra statistical tweaks:
https://github.com/scikit-learn-contrib/boruta_py

parrt added the enhancement New feature or request label Oct 14, 2018

parrt closed this as completed Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature correlation p-values and correction methods #10

Feature correlation p-values and correction methods #10

feribg commented Oct 13, 2018

parrt commented Oct 14, 2018 •

edited

feribg commented Oct 15, 2018 •

edited

parrt commented Nov 25, 2020

feribg commented Dec 2, 2020

Feature correlation p-values and correction methods #10

Feature correlation p-values and correction methods #10

Comments

feribg commented Oct 13, 2018

parrt commented Oct 14, 2018 • edited

feribg commented Oct 15, 2018 • edited

parrt commented Nov 25, 2020

feribg commented Dec 2, 2020

parrt commented Oct 14, 2018 •

edited

feribg commented Oct 15, 2018 •

edited