Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CascadeForestRegressor somehow cannot be inserted into a DataFrame #87

Closed
IncubatorShokuhou opened this issue Jul 13, 2021 · 10 comments · Fixed by #88
Closed

[BUG] CascadeForestRegressor somehow cannot be inserted into a DataFrame #87

IncubatorShokuhou opened this issue Jul 13, 2021 · 10 comments · Fixed by #88
Labels
needtriage Further information is requested

Comments

@IncubatorShokuhou
Copy link
Contributor

IncubatorShokuhou commented Jul 13, 2021

Describe the bug
CascadeForestRegressor somehow cannot be inserted into a DataFrame

To Reproduce

import pandas as pd
from deepforest import CascadeForestRegressor
from ngboost import NGBRegressor

ngr = NGBRegressor()  # ngboost regressor for example. xgb, lgb should also be no problem.
cfr = CascadeForestRegressor()
df= pd.DataFrame()

# somehow OK
df.insert(0, "ngr", [ngr])
# somehow error
df.insert(0, "cf", [cforest])

Expected behavior
No error

Additional context

ValueError                                Traceback (most recent call last)
<ipython-input-32-ab0139d10254> in <module>
----> 1 df.insert(0, "cf", [cforest])

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
   3760             )
   3761         self._ensure_valid_index(value)
-> 3762         value = self._sanitize_column(column, value, broadcast=False)
   3763         self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
   3764 

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   3900             if not isinstance(value, (np.ndarray, Index)):
   3901                 if isinstance(value, list) and len(value) > 0:
-> 3902                     value = maybe_convert_platform(value)
   3903                 else:
   3904                     value = com.asarray_tuplesafe(value)

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in maybe_convert_platform(values)
    110     """ try to do platform conversion, allow ndarray or list here """
    111     if isinstance(values, (list, tuple, range)):
--> 112         values = construct_1d_object_array_from_listlike(values)
    113     if getattr(values, "dtype", None) == np.object_:
    114         if hasattr(values, "_values"):

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in construct_1d_object_array_from_listlike(values)
   1636     # making a 1D array that contains list-likes is a bit tricky:
   1637     result = np.empty(len(values), dtype="object")
-> 1638     result[:] = values
   1639     return result
   1640 

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in __getitem__(self, index)
    518 
    519     def __getitem__(self, index):
--> 520         return self._get_layer(index)
    521 
    522     def _get_n_output(self, y):

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in _get_layer(self, layer_idx)
    561             logger.debug("self.n_layers_ = "+ str(self.n_layers_))
    562             logger.debug("layer_idx = "+ str(layer_idx))
--> 563             raise ValueError(msg.format(self.n_layers_ - 1, layer_idx))
    564 
    565         layer_key = "layer_{}".format(layer_idx)

ValueError: The layer index should be in the range [0, 1], but got 2 instead.

This bug can be simpliy fixed if we change if not 0 <= layer_idx < self.n_layers_: to if not 0 <= layer_idx <= self.n_layers_:, but I still don't know the cause of this error and whether this fix is corret.

@xuyxu
Copy link
Member

xuyxu commented Jul 13, 2021

Hi @IncubatorShokuhou, I would like to ask that what is the purpose of storing the model in a pandas dataframe?

@IncubatorShokuhou
Copy link
Contributor Author

@xuyxu Actually I am trying to integrate deep-forest into PyCaret. In theory, PyCaret supports all ml algorithms with scikit-learn-Compatible API. In practice, most models, including xgboost, lightgbm, catboost, ngboost, explainable boosting matching et al. can be easily integrated.

Here is the example code:

from pycaret.datasets import get_data
boston = get_data('boston')
from pycaret.regression import *
from deepforest import CascadeForestRegressor
from ngboost import NGBRegressor

# setup, data preprocessing
exp_name = setup(data = boston,  target = 'medv',silent = True)

# establish regressors
ngr = NGBRegressor()
ngboost = create_model(ngr)

cfr = CascadeForestRegressor()
casforest = create_model(cfr)

# compare models
best_model = compare_models(include=[ngboost,casforest,"xgboost","lightgbm"])

# save model
save_model(best_model , 'best_model ')

During the integration, I met 2 errors: 1. the Deep-Forest only accepts np.array, and cannot input pd.DataFrame, which could be easily fixed by #86 . 2. In line 2219 of https://github.com/pycaret/pycaret/blob/c76f4b7699474bd16a2e2a6d0f52759ae29898b6/pycaret/internal/tabular.py#L2219 , the model object is put into a pd.DataFrame, and the bug described above happened, which is quite weird for me.

I guess there might be something wrong with the initialization. Wish you could give me some suggestions.

@xuyxu
Copy link
Member

xuyxu commented Jul 13, 2021

Thanks for your kind explanations! I will take a look at your PR first ;-)

@xuyxu xuyxu added the needtriage Further information is requested label Jul 13, 2021
@IncubatorShokuhou
Copy link
Contributor Author

IncubatorShokuhou commented Jul 13, 2021

BTW, could you please telling me why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used in line 50 of https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#L50 . And in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#91, is lgb = __import__("lightgbm.sklearn") simply equal to import lightgbm.sklearn as lgb ?

@xuyxu
Copy link
Member

xuyxu commented Jul 13, 2021

why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used

sklearn.ensemble.RandomForestClassifier is too slow when fitted on large datasets with millions of samples

lgb = import("lightgbm.sklearn")

We prefer to treat lightgbm as a soft dependency. If we use import lightgbm.sklearn as lgb in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.

@IncubatorShokuhou
Copy link
Contributor Author

why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used

sklearn.ensemble.RandomForestClassifier is too slow when fitted on large datasets with millions of samples

lgb = import("lightgbm.sklearn")

We prefer to treat lightgbm as a soft dependency. If we use import lightgbm.sklearn as lgb in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.

I see. So maybe I can write a simple GPU version for the three models using cuML.ensemble.RandomForest and gpu_hist ?

@xuyxu
Copy link
Member

xuyxu commented Jul 13, 2021

The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).

@IncubatorShokuhou
Copy link
Contributor Author

IncubatorShokuhou commented Jul 13, 2021

The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).

OK, I see.

@IncubatorShokuhou
Copy link
Contributor Author

IncubatorShokuhou commented Jul 14, 2021

@xuyxu I think I have figure out the reason of this error.
In https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#112 , pandas will first identify whether the object has a __ len__ method. If true, pandas will try to transform this list-like object(aka CascadeForestRegressor()) in a 1-dimensional numpy array of object dtype via construct_1d_object_array_from_listlike in https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#L1970 .
So this error actually occur in

result = np.empty(0, dtype="object")
result[:] = CascadeForestRegressor()

and when trying to put CascadeForestRegressor() into a empty np.array, __getitem__ in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#540 is called, then the error occured.

Actually, the error can be more significantly reproduced in another way:

# basic example
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from deepforest import CascadeForestClassifier

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestClassifier(random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print("\nTesting Accuracy: {:.3f} %".format(acc))

# now the model have 2 layers. Iterate it.
for i,j in enumerate(model):
    print("i = ")
    print(i)
    print("j = ")
    print(j)
    print("ok")

and here is the error:

i = 
0
j = 
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
                           criterion='gini', layer_idx=0, n_estimators=4,
                           n_outputs=10, random_state=1)
ok
i = 
1
j = 
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
                           criterion='gini', layer_idx=1, n_estimators=4,
                           n_outputs=10, random_state=1)
ok
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-29-c9f04ba43562> in <module>
----> 1 for i,j in enumerate(model):
      2     print("i = ")
      3     print(i)
      4     print("j = ")
      5     print(j)

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in __getitem__(self, index)
    518 
    519     def __getitem__(self, index):
--> 520         return self._get_layer(index)
    521 
    522     def _get_n_output(self, y):

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in _get_layer(self, layer_idx)
    561             logger.debug("self.n_layers_ = "+ str(self.n_layers_))
    562             logger.debug("layer_idx = "+ str(layer_idx))
--> 563             raise ValueError(msg.format(self.n_layers_ - 1, layer_idx))
    564 
    565         layer_key = "layer_{}".format(layer_idx)

ValueError: The layer index should be in the range [0, 1], but got 2 instead.

Then I noticed that https://docs.python.org/zh-cn/3/reference/datamodel.html#object.__setitem__ introduces:

Note for loops expect that an IndexError will be raised for illegal indexes to allow proper detection of the end of the sequence.

That's it! Deep-Forest raises a ValueError insted of IndexError by mistake. When I changed it, everything is ok!

@IncubatorShokuhou
Copy link
Contributor Author

I am going to create a PR and fix this error ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needtriage Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants