Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature correlation p-values and correction methods #10

Closed
feribg opened this issue Oct 13, 2018 · 4 comments
Closed

Feature correlation p-values and correction methods #10

feribg opened this issue Oct 13, 2018 · 4 comments
Labels
enhancement New feature or request

Comments

@feribg
Copy link
Contributor

feribg commented Oct 13, 2018

Wanted to get the conversation open on feature correlation, right now it just does a naive spearmanr, with no insight into the resulting p-values. Would be great to do a few things, listed below in order of importance:

  1. Introduce p-values and maybe apply the appropriate cutoffs
  2. Introduce permutation based correlation, starting off with lagged correlations for example (context is time series analysis)
  3. Introduce a probability correction method for 1 and/or 2 such as bonferroni, to account for the number of correlation estimates we're doing between features and between number of lags if we end up implementing Added ability to directly supply fig size to feature importance plots #2.

Happy to get the conversation going and see where we end up. Right now the feature correlation estimation is not quite stable in the context of very noisy time series data.

@parrt parrt added the enhancement New feature or request label Oct 14, 2018
@parrt
Copy link
Owner

parrt commented Oct 14, 2018

OOOPS feature_dependence_matrix() is already in the package. haha. Ignore this response other than to compare to what you are considering. :)

Another thing to add would be predicting feature i using all features j≠i but I'm not sure what you call that. Multicollinearity perhaps? I've noticed that many seemingly totally unrelated features are collinear in some way even when their spearman's coefficient is low. Dropping such a seemingly important feature then has no effect on the overall model accuracy because some of the other features are picking up the slack. I could swear I did some tests recently but I can't find the code. Oh right. It's a simple matter of dropping the original target and a feature of interest. For example, if we are predicting rent price but would like to know the strength of the relationship between predictor doorman and the other features, we can train a model using that as the target:

X_train, y_train = df_aug.drop(['price','doorman'], axis=1), df_aug['doorman']

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
rf.fit(X_train, y_train)
print(rf.oob_score_)

With a high R^2 like .8 or .9, we can predict that feature pretty well from the other features. I recall that doorman was a fairly important feature and I was wondering why dropping it did not affect the model accuracy very much. Well, obviously the other features are covering for it but I think the correlation coefficients were low against the other features.

def feature_dependence_matrix(X_train):
    """
    Given training observation independent variables in X_train (a dataframe),
    compute the feature importance using each var as a dependent variable.
    We retrain a random forest for each var as target using the others as
    independent vars.  Only numeric columns are considered.

    :return: a non-symmetric data frame with the dependence matrix where each row is the importance of each var to the row's var used as a model target.
    """
    numcols = [col for col in X_train if is_numeric_dtype(X_train[col])]

    df_dep = pd.DataFrame(index=X_train.columns, columns=['Dependence']+X_train.columns.tolist())
    for i in range(len(numcols)):
        col = numcols[i]
        X, y = X_train.drop(col, axis=1), X_train[col]
        rf = RandomForestRegressor(n_estimators=50, n_jobs=-1, oob_score=True)
        rf.fit(X,y)
        #imp = rf.feature_importances_
        imp = permutation_importances_raw(rf, X, y, oob_regression_r2_score)
        imp = np.insert(imp, i, 1.0)
        df_dep.iloc[i] = np.insert(imp, 0, rf.oob_score_) # add overall dependence

    return df_dep

I get feature dependence like this:

screen shot 2018-10-13 at 5 38 33 pm

The dependencies are now clear. doorman is predictable using other features. Some make sense like the GPS stuff but doorman through me a bit.

@feribg
Copy link
Contributor Author

feribg commented Oct 15, 2018

@parrt I had a look at this function and it's a good starting point to use as a base but I have several questions/comments around it, because it seems you've modified the code above and it's different from what's there on master now (which seems broken to me):

  1. It assumes a random forest classifier (presumably the same one that we use for the final model. The problem is a lot of the variables that are used as features are continuous so when converting them to targets the whole thing falls apart. (You substituted for a regressor above, but i guess we should provide a flex impl as a param)
  2. These couple of lines are a bit ambiguous to me:
imp = permutation_importances_raw(rf, X, y, oob_regression_r2_score, n_samples)
imp = np.insert(imp, i, 1.0)
df_dep.iloc[i] = np.insert(imp, 0, rf.oob_score_) # add overall dependence

Why are we using the oob_regression_r2 and what's it's meaning, seems like we want classification but are running r2 as a scoring function, which is confusing to me. Would it always be regression even if we have a mix of continous and categorical ?
3. To the above, should we not add support for val_set, if oob_score_ is not reliable or not present at all.
Happy to discuss and cleanup that function first in master than build on top of it.

@parrt
Copy link
Owner

parrt commented Nov 25, 2020

Hi. I don't think I'm going to be pushing too many new features into this package so I will close this to clean up.

@parrt parrt closed this as completed Nov 25, 2020
@feribg
Copy link
Contributor Author

feribg commented Dec 2, 2020

@parrt fair, this is I think a more full fledged library that uses the same method at it's core with some extra statistical tweaks:
https://github.com/scikit-learn-contrib/boruta_py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants