validate() for rms::ols: Error in lsfit(x, y) : only 0 cases, but 2 variables #52

Deleetdk · 2017-12-14T07:42:27Z

I get a strange sounding error when trying to use validate() on a fitted ols:

Error in lsfit(x, y) : only 0 cases, but 2 variables

The dataset has n=1890 with about 400 predictors in the model. Almost all the predictors are dichotomous dummies indicating whether some regex pattern matched a name or not. Some of these only have a few true cases (but at least 10). This is a preliminary fit before I am doing some penalization to improve the model fit and final predictors (done with LASSO in glmnet). However, I wanted to validate the validity of the initial model. My guess is that the error occurs due to the resampling ending up with no cases for a given variable in the training set, which causes it to fail to fit / not be able to use that variable in the prediction in the test set.

For a reproducible example, here's a similar dataset based on iris:

#sim some data
iris2 = iris
set.seed(1)
iris2$letter1 = sample(letters, size = 150, replace = T)
iris2$letter2 = sample(letters, size = 150, replace = T)
iris2$letter3 = sample(letters, size = 150, replace = T)

#fit
(fit = rms::ols(Sepal.Width ~ letter1 + letter2 + letter3 + Petal.Width, Petal.Length, data = iris2, x = T, y = T))
validate(fit)

Gives:

Error in lsfit(x, y) : only 0 cases, but 2 variables
In addition: Warning message:
In lsfit(x, y) : 150 missing values deleted

The dataset has no missing data.

In my own simple cross-validation implementation discussed here, I got around this issue by simply ignoring runs that produce errors. See this question: https://stats.stackexchange.com/questions/213837/k-fold-cross-validation-nominal-predictor-level-appears-in-the-test-data-but-no Maybe this too should be done for rms?

The text was updated successfully, but these errors were encountered:

harrelfe · 2017-12-15T02:01:22Z

Thanks for the report. There was a bug for ols for validate and calibrate where singular fits were reporting NAs instead of setting fail=TRUE so that that sample would be ignore. This is fixed for the next release.

Deleetdk · 2017-12-16T22:00:45Z

Updating to the Github version, validate no longer throws as error, but it gives useless output for my use case as all 40 runs failed:

> validate(ols_fit)

Divergence or singularity in 40 samples
          index.orig training test optimism index.corrected n
R-square       0.572      NaN  NaN      NaN             NaN 0
MSE            0.425      NaN  NaN      NaN             NaN 0
g              0.000      NaN  NaN      NaN             NaN 0
Intercept      0.000      NaN  NaN      NaN             NaN 0
Slope          1.000      NaN  NaN      NaN             NaN 0

In the iris example case, it is also almost useless. Despite 40 runs, only 2 completed:

> validate(fit)

Divergence or singularity in 38 samples
          index.orig training   test optimism index.corrected n
R-square      0.5504   0.8728 -0.931    1.804         -1.2536 2
MSE           0.0848   0.0234  0.364   -0.341          0.4258 2
g             0.3504   0.4573  0.191    0.266          0.0845 2
Intercept     0.0000   0.0000  2.177   -2.177          2.1766 2
Slope         1.0000   1.0000  0.289    0.711          0.2886 2

My guess is the same as before: one has to use special sampling to avoid the issue. As someone on Cross Validated suggested:

You could look into stratified sampling, i.e. constraining your train/test splits so that they have (approximately) the same relative frequencies for your predictor levels.

However, I think it worth considering whether the current behavior is actually wanted: So random splitting with non-negligible frequency results in sets that do not cover all predictor levels. Can you consider such a set representative for whatever the application is?
I've been working with such small sample sizes and went for stratified splitting. But I insist that thinking hard about the data and the consequences of working with such small samples is at least as necessary as fixing the pure computational error.

harrelfe · 2017-12-16T22:03:44Z

The behavior you saw is the intended behavior when the sample size does not support a large number of parameters. You'll need to reduce the number of parameters in the model.

Deleetdk · 2018-01-09T03:45:29Z

How do you recommend that I validate models that contain a large number of logical predictors without running into this issue?

harrelfe · 2018-01-09T04:19:25Z

You have too many parameters in the model.

mirhassan121 · 2024-09-16T04:57:06Z

In addition: Warning messages:
1: In lsfit(x, y) : 16 missing values deleted
2: In lsfit(x, y) : 16 missing values deleted
how solve this issue

Deleetdk mentioned this issue Dec 16, 2017

Error in htmlSpecial("part") : illegal character name:part #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate() for rms::ols: Error in lsfit(x, y) : only 0 cases, but 2 variables #52

validate() for rms::ols: Error in lsfit(x, y) : only 0 cases, but 2 variables #52

Deleetdk commented Dec 14, 2017

harrelfe commented Dec 15, 2017

Deleetdk commented Dec 16, 2017

harrelfe commented Dec 16, 2017

Deleetdk commented Jan 9, 2018

harrelfe commented Jan 9, 2018

mirhassan121 commented Sep 16, 2024

validate() for rms::ols: Error in lsfit(x, y) : only 0 cases, but 2 variables #52

validate() for rms::ols: Error in lsfit(x, y) : only 0 cases, but 2 variables #52

Comments

Deleetdk commented Dec 14, 2017

harrelfe commented Dec 15, 2017

Deleetdk commented Dec 16, 2017

harrelfe commented Dec 16, 2017

Deleetdk commented Jan 9, 2018

harrelfe commented Jan 9, 2018

mirhassan121 commented Sep 16, 2024