New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature correlation p-values and correction methods #10
Comments
OOOPS Another thing to add would be predicting feature i using all features j≠i but I'm not sure what you call that. Multicollinearity perhaps? I've noticed that many seemingly totally unrelated features are collinear in some way even when their spearman's coefficient is low. Dropping such a seemingly important feature then has no effect on the overall model accuracy because some of the other features are picking up the slack. I could swear I did some tests recently but I can't find the code. Oh right. It's a simple matter of dropping the original target and a feature of interest. For example, if we are predicting rent price but would like to know the strength of the relationship between predictor X_train, y_train = df_aug.drop(['price','doorman'], axis=1), df_aug['doorman']
rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
rf.fit(X_train, y_train)
print(rf.oob_score_) With a high R^2 like .8 or .9, we can predict that feature pretty well from the other features. I recall that def feature_dependence_matrix(X_train):
"""
Given training observation independent variables in X_train (a dataframe),
compute the feature importance using each var as a dependent variable.
We retrain a random forest for each var as target using the others as
independent vars. Only numeric columns are considered.
:return: a non-symmetric data frame with the dependence matrix where each row is the importance of each var to the row's var used as a model target.
"""
numcols = [col for col in X_train if is_numeric_dtype(X_train[col])]
df_dep = pd.DataFrame(index=X_train.columns, columns=['Dependence']+X_train.columns.tolist())
for i in range(len(numcols)):
col = numcols[i]
X, y = X_train.drop(col, axis=1), X_train[col]
rf = RandomForestRegressor(n_estimators=50, n_jobs=-1, oob_score=True)
rf.fit(X,y)
#imp = rf.feature_importances_
imp = permutation_importances_raw(rf, X, y, oob_regression_r2_score)
imp = np.insert(imp, i, 1.0)
df_dep.iloc[i] = np.insert(imp, 0, rf.oob_score_) # add overall dependence
return df_dep I get feature dependence like this: The dependencies are now clear. |
@parrt I had a look at this function and it's a good starting point to use as a base but I have several questions/comments around it, because it seems you've modified the code above and it's different from what's there on master now (which seems broken to me):
Why are we using the oob_regression_r2 and what's it's meaning, seems like we want classification but are running r2 as a scoring function, which is confusing to me. Would it always be regression even if we have a mix of continous and categorical ? |
Hi. I don't think I'm going to be pushing too many new features into this package so I will close this to clean up. |
@parrt fair, this is I think a more full fledged library that uses the same method at it's core with some extra statistical tweaks: |
Wanted to get the conversation open on feature correlation, right now it just does a naive spearmanr, with no insight into the resulting p-values. Would be great to do a few things, listed below in order of importance:
Happy to get the conversation going and see where we end up. Right now the feature correlation estimation is not quite stable in the context of very noisy time series data.
The text was updated successfully, but these errors were encountered: