Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with current ML validation score #40

Open
GinoWoz1 opened this issue Nov 26, 2018 · 3 comments
Open

Issues with current ML validation score #40

GinoWoz1 opened this issue Nov 26, 2018 · 3 comments

Comments

@GinoWoz1
Copy link

GinoWoz1 commented Nov 26, 2018

Hello,

Thanks for the help so far. I was able to get the tool up and running in windows.

However, 2 weird things I am observing.

  1. When I use Gradient Boost Regressor - my score gets worse by the generation even when I switched the scoring function sign. The first score is nearly my best score I have gotten by myself (no feature engineering done on data set).

https://github.com/GinoWoz1/AdvancedHousePrices/blob/master/FEW_GB.ipynb

  1. When I use Random Forest - same scorer - current ML validation score returns as 0 and runs really fast

https://github.com/GinoWoz1/AdvancedHousePrices/blob/master/FEW_RF.ipynb

I think I am missing something on how to use this tool but no idea what. I am trying to use this in tandem with TPOT as I am exploring feature creation GA/GP based tools. Sincerely appreciate any advice/guidance you can provide.

Sincerely,
G

@GinoWoz1
Copy link
Author

GinoWoz1 commented Dec 5, 2018

Hello @lacava

Sorry for the bother , but have you had a chance to look at this ? I have been messing around with TPOT for the last 4 months and have talked to Randy Olson a few times ; he had referred me to Few and I am hoping to do a few tests with Few and Tpot over the winter . My name is Justin Joyce and currently I am exploring multiple genetic algorithm and programming methods as a masters student .

Sincerely,
Justin

@lacava
Copy link
Owner

lacava commented Dec 5, 2018

Hi Justin, I did look at it and ran it a couple times. It looks like there is a small bug with Few, which is that it prints out that the current ml validation score is 0 when it is not, as shown by the internal CV score that is printed.

Otherwise, this just seems to be a dataset that is not amenable to feature learning. I have found that, when paired with Gradient boosting or other high-capacity methods, it is quite difficult to find a transformation of the data that will improve the underlying ML using Few. Using Lasso, I was able to occasionally find a reduced feature space, but not one that dramatically improved the score.

You also may be interested in trying Feat, which is a more powerful version of Few that I have been working on for the last year. It has a similar sklearn interface, uses a GA to drive search, and includes neural network activation functions and backprop for learning weights. Here's the result of running that:

from feat import Feat
from sklearn.metrics import r2_score


learner = Feat(gens=1000,max_stall=100,pop_size=100,backprop=True,
               verbosity=2,
               max_dim=50,
               feature_names = ','.join(X_train.columns))
X = X_train.values
y = y_train
learner.fit(X,y)


print('final score: {}'.format(r2_score(y_train, learner.predict(X)))) 
  
print('model:\n',learner.get_model())

final score: 0.9004700679745707
model:
Feature Weight
relu(2ndFlrSF) 4575965.203086
(2ndFlrSF^2) -2993884.266219
(2ndFlrSF^3) 2704420.880597
2ndFlrSF -2653911.531777
relu(2ndFlrSF) -2328605.023701
(GrLivArea^3) -972888.992427
LotArea 829523.137047
YearBuilt 741179.594699
relu(GrLivArea) 689061.927277
relu(OverallQual) 630256.458467
relu(LotArea) -593285.483499
TotalBsmtSF 536721.385827
float(OverallCond) 408904.496879
sqrt(|YearBuilt|) 341173.116787
1stFlrSF 340342.422143
float(GarageCars) 286630.673027
OverallQual 230831.115685
(OverallQual*GarageArea) 214190.124423
BsmtFinSF1 192567.902371
(TotalBsmtSF+BsmtUnfSF) -192281.015785
relu(OverallQual) 189096.541817
float(Fireplaces) 183822.299462
GrLivArea 155897.078241
float(Condition2_Norm) 116659.048371
ScreenPorch 102781.956623
float(Neighborhood_OldTown) -100922.111480
float(HalfBath) 97458.382917

The downside is that you can't specify your own scoring_function at the moment.

@lacava
Copy link
Owner

lacava commented Dec 5, 2018

When I use Gradient Boost Regressor - my score gets worse by the generation even when I switched the scoring function sign. The first score is nearly my best score I have gotten by myself (no feature engineering done on data set).

This i did not observe. I did observe that Few did not find better features, but the Internal CV stayed constant, as it should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants